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Patent Application 

^ The present invention relates to the subject matter claimed 
and hence refers to a method and a device for conipili.n9 pro- 
grains for a reconfigurable device. 

Reaonfigurable devices are -well— known. They include systrolic 
arrays^ neuronal networks^ Mul-tiprocesacr syab^Ria^ Prozessoren 
comprising a plurality of ALU and/or logic cell3# crossbar- 
switches, as well as FPGAs, OPGzvs^ xpuT&Ba, asf « . Reference is 

being made to DS 44 l€r 881 Alf DS 197 81 412 Air DB 197 81 483 
Al, OE 196 54 846 Al, DE 196 54 593 Al, DS 197 04 044.6 Al, 
DE 198 80 12d Al, DB 198 61 088 Al^ DB 199 80 312 Al^ 
^> PCT/DS 00/01869, DS 100 36 627 Al, OB 100 28 397 Al/ 
OE 101 10 £30 Al, DE 101 11 014 Al, PCT/BP 00/105X6, 
EP 01 X02 674 Al, DE 198 80 128 Al^ DB 101 39 170 Al, 
DE Ida 09 640 Al, DE 199 26 538.0 Al, DB 100 50 442 Al the 
full disclosure o£ which ±a Ineacpovatseel h^TOln £or pujrposes 
of referenoe* 



Furthermore, reference 1$ being made to devioes and methods as 
knoW& farom DS E*S 6,3li;2300; US £>S 6 ,298,472; US PS 6,288,566; 
ua PS 6,282,627; US 9S 6,243,808 issued to Cham^leonsysteme 
INC, USA noting that the disclosure of the present application 
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Is . pexl^lnent In at least some aspects to somo oe ishe devices 

The invention will now be deacvibed by the following paperis 
which are part of the present application. 
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1 Introducdoii 

This document describes the PACT Vectorizmg C Compaer XPP-VC which maps a C subset extended 
by port access functions to PACT'S Native Mapping Language NML. A Aiture extension of this compiler 
for a host-XPP hybrid system is desoibed in Section 7.3. 

XPP^VC uses the public domain SXJIF compiler system. For instaUation instructions on both SUIF and 
XPP-VC, refer to the separately available instaUation notes. 



2 General Approach 

The XPP-VC hnpleraentation is based on the public domain SUIF compiler fiameworic (cf, 
http : / /suif . Stanford. edu)« SUIF was chosen because it is easily esctensible* 

SUIF was extended with two passes: part ition and runlgen. The first pass, partition, tests if 
^h?^ the program complies with tfie restrictions of the compiler (cf. Section 3.1) and perfonns a dependence 
' analysis. It determines if a FOR-^ioop can be vectorised and annotates the syntax tiee accordingly. In 

XPP-VC, veetorization means that loop iterations are overiappad and execu^ in a pipelined, parallel 
^shion. This technique is based on the Pipeline Vectorization method developed for reconfigOFable 
arciiicectures.' partition also completely unrolls inner program FOR-Ioops wtiich are annotated by 
the user. Ail innermost loops (after unrolling) which can be vectorized are selected and annotated for 
pipeline synthesis. 

nmlgen generates a control/dataflow grai)h for the program as follows. First, program data is allocated 
on the XPP Core. By default, nmlgen maps each program array to internal RAM blocks wtiile scalar 
variables are stored in registers witliin the PAEs. If instrocted by a pragma directive (cf. Section 3*2.2), 
axrays are mapped to external RAM. If it is large enough, an external RAM can hold several arrays. 

Next, one ALU is aflocated for each operator in the program (after loop unrolling, if applicable). The 
ALUs are connected accoixiing to the data-lBow of the program. This data-driven execution of the op- 
erators automatically yields some instruction-'Ievel parallelism witliin a ba^c block of the program, but . 
the basic blocks are normally executed in their ori^nal, sequential order, controlled by event signals. 
However, for generating more efficient XPP Core configurations) nmlgen generates pipelined opera- 
" tor networks for inner program loops which have been annotated for vectorization by partition. In 
other woids» subsequent loop iterations are scaned before previous iterations have finished. Data packets 
- ' flow continuously through the operator pipelines. By applying pipeline balancing techniques, maximum 

diroaghput i^ achieved. For many programs, additional performance gains are achieved by the complete 
loop unrolling transformation. Though unrolled loops require more XPP resources because individual 
PAEs are allocated for each loop iteration, fliey yield more parallelism and better exploitation of the XPP 
Core. 

FinaSy» runlgen outputs a s&If-contained NML file cont^ning a module which implements theptogram . 
on an XPP Core. The XPP IP parameters for the generated NML file are read from a configuration file, 
cf. Section 4. Thus the parameters can be easily changed. Obviously, large programs may produce NML 
files which cannot be placed and routed on a ^ven XPP Core. Later XPP-VC releases wfll perform a 
temporal partitioning of C programs in order to overcome this limitation, cf. Section 7.1. 

^Cf. M. Weinhaidt and W. Luk: PipeUne Vect&nz/^ihn^ lEEE'naxisscdona OH Computer-Aided Design of Incesrated Circuits 
and systems, Feb. 2001, pp. 234-248. 
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3 Language Coverage 

This Section describes which C jSles can currently be handled by XPP-VC* 

3.1 Restrictions 

3.1.1 XFP Restrictions 

The following C langu^ opearations cannot be mapped to at) XPP Core at aH. TTiey ase not allowed in 
XPP-VC programs and need to be mapped to the host processor in a codesign confer, cf . Sectira 7.3, 

• Operating system calls, including I^O 

• Division, modulo, non-constant shift and floating point operations (unless XPP Ccnne's ALU sup- 
.v::\, ports tftem)^ 

T * The sise of ajnrays mapped to internal RAMs is limited by the number and size of internal RAM 

blocIc9. 

3.1.2 XPP-VC Compiler Restrictfom 

The cmrent XPP-VC implementation necessitates the following xestiictions: 

1 . No multi-dimensional constant arrays (due to the SUIP version cuxrently used) 

2. No swltch/ca^e statements , 
3« No struct: data types 

4. No fiioctioo calls except the XPP port and pragma fbncdons defined in Section 3.2. 1 • The program 
must only have one function Cmaln). 

.'^.•j^ 5. No pointer operati<ms 

. 6. No library calls or recursive calls 

7. No irregular control flow (break, continue/ goto, label) 

Addidonally, there are currently some implementation-dependent restrictions for vectorized loops» cf, 
the Release Nots^. lite compiler produces an e^lanatory message if an inner loc^ cannot be pipelined 
despite the absence of dependences. However, for many of these cases, simple workarounds tiry minor 
program changes are available. Furthermore* programs which are too large for one configuration cannot 
be handled. They should be split in^ several configurad<ms and sequenced onto the XPP Core, using 
NML*s recon0guration commands. This will be performed autornatically in later releases by temporal 
partitioning, cf. Section 7,1- 

^In future XPP-VC r&leases, an altemadve, sequential impletnentation of these operations by NML macros will be available. 
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35 XPP-VCC Language Eartensioiis 

We now describit: language extensions used by XPP-VC. In order to use these extensions, the C program 
must contain ftie following line: 

#include "XPP.h" 

This header file, XPRk, defines the port Amotions defined below as well as tfxe pragma function 
XPPainroll () . If XPP-unroll () dixecfly precedes a FOR loop, it will be completely unrolled 
by partition^ cf. Section 6.2. 

3-2-1 XPP Port Functioiis 

Since the noimal C VO fvncUons cannot be used on an XPP Core, a method to access the XPP I/O units 
in port mode is provided. XPRh contains the definition of the following two functions: 

XPE»^etstreain(int ionum, int portnumi. int *value) 

XPPjput stream (int: ionumr int: portnumr int value) 

ionum refers to an VO unit (1..4X and portnum to the port used in this I/O unit (0 or 1). For die 
duration of the execution of a program, an I/O nnit may only be used either for port accesses or for 
RAM accesses (see below). If an J/O unit is used in port mode» each portnum can only bo used either 
for read or for write accesses during the entire program execution. In (he access fractions, value is 
the data received from or written to the stream. Note that KPP-get: stream can currently only read 
values into scalar variables (not directly into array elements!), whereas xpp.put stream can handle 
any explosions. An example program using these fiinctions is presented in Section 6.1. 

3 JL2 pragma Directiv£S 

Arrays can be allocated to external memory by a compiler directive; 
#pragnia extern <var> <RAiyL.number> 

Example: #pragma extern x 1 maps array x to external memory bank 1. 
Note the following: 

• <var> must be defined it is used in the pragma. 

• Bank <R2^M-nuraber> must be declared in the file jq>pvcuoptions^ gL Secdon 4- 

• If two arrays are allocated to the same external RAM bank, they are arranged in the order of 
appearance of their respective pragma directives. The resulting offsets are recorded infileJtf^ cf . 
Section 5A* 
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4 Directories and FUes 

After cojrect installation, the XPPC J^OOT environment variable is defined, and the PATH variable 
extendcfd. $XPPCJROOT is the XPP-VC root directoiy. $XPPCJR0Orrt7in contains all binaxy files 
and the scripts xppvcmafce and xppgcc. $XPPCJROOT/doc contains fins niantiai and the file 
pvc^releasenotes.m. XPP,h is located in the include subdirectory: 

Finally, $XPPC-RO0T/lib contains the options file jq^pvc^ptions. If an opdons file with die same name 
e;dst in fiie current working ditectoiy of the ,xds subdkectofy of die user's home directory, they aie used 
(in this order) instead of die master file in $XPPC JiOCXMih. 



Opdon 


Expi^iation 


Default value in 






xppvc^pt ions 


debug 


debug output enabled 


on 


version 


XPP IP version 


V2 


pacsize 


number of ALU-PAEs in x and y direcdon 


6/12 


xppsize 


number of PACs in x and y direcdon 


I/l 


busnumber 


number of data and event bu$es per row (bofix dir«s} 


6/6 


iramsize 


number of woxds in one interna! RAM 


256 


bitwidth 


XPP data bit widdi 


32 


fineg.^iata.port 


number of FREQ data ports 


3 


breg-data.port 


number of BREG data ports 


3 


freg-event.poit 


number of FREG event ports 


4 


bregjBvenLpoit 


number of BREG event pom 


4 



Ihble h Opdons 



jqppvc^ptims sets the compiler opdons listed in liable Most of them define the XPP IP parameters 
which are used in the generated NML file. Lines starting witii a # character are comment lines. 

Addidonally, extram followed by four integers declares die external RAM banks used for stormg ar- 
rays. At most four external RAMs can be used. Each integer represents the size of the bank declared. 
Size zero must be used for banks which do not exist. The master file contains die following line which 
declares four 4GB (I 0 wotds) external banks: 

extram 1073741824 1073741824 1073741824 1073741824 

Note that, in order to simplify programming, xppvc-options does not have to be changed if an I/O 
unit is used for port accesses. However, this memory bank is not available in diis'case despite being 
declared. 



5 Using XPP-VC 

5.1 xppvcm^ke 

In order to create an NML file, file.c is compiled widi the command xppvcmake file - nna, xp- 
pvcntake file . xbln addidonally calls xtnap. With xppvcmake, JtJViA is automadcally searched 
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Vectorizing C Compiler XPP-VC 5 

for in <ftwtoxy $XPPC^OOT/mdude. 

The following output produced by translatiog the example program stueofnJtKc In Section 6.1 shows the 
programs called by xppvcmake: 

$ xppvcmake streainf Ir .nml 

pscc -I/hotrte/wema/xppc/include -parallel -no P0HKy_F0RWARD_PR0P4 

-.epr streamfir -c 
porky -dead-code streamfir .spr s-breaxnf ir *spr2 
partition streatnfiir,3pr2 atreamfir.svo 
Program analysis:* 

main*; DO-LOOP, line 9 can be synthesized 
main: can be synthesized completely 
Program partitioning; 

Entire program selected for XPU module synthesis, 
main: DO-LOOP, line 9 selected for synthesis 
porky -const-prop -scalarize rcopy-prop -dead-code streamfir.svo 

streamfir . svol 
predep -normalize streamf ir.svol streamfir .svo2 
porky -ivar -know-bounds -fold atreamfir.svo2 streamfir.sur 
nmlgen streamfir.sur streamfir.xco 

pscc is the SUIF frontend v^hich translates sireamfbic Into the SUIP intermediate representadon, and 
porky perfbims some standard optimizations. Next, partition analyzes the program. The output 
indicates that the entire program can and will be mapped to NML. Thm. porky and predep perform 
some additional optimizations before nmlgen actually generates the file streeanpKnmL The SUIF file 
sTreamfinxco is generated to inspect and debug the result of code transfcnnations.^ In the generated NML 
file, only the VO ports are placed. All other objects axe placed automatically by xmap. Cf. Section 6.1 
for an example of the xsim program using the VO pons conesponding to the stream lEuncdons used in 
Ae progi^m. 

For an input S\tfile.c^ nmlgen also creates an interface description file file^Ufm the working directoiy. It 
J , . shows the array to RAM mapping chosen by the compiler. In the debug subdirectory (which is created), 

' fi\Qsfile.partjdbg mAfil€.nmlgenjdbg are generated. They contain more detailed debugging mformation 

; created by part it i on and nml ge n respectively. The Siesjile^rstdot dxtdfile^JinaLdot created in the 

debug directory can be viewed with the dotty graph layout tool. They contain graphical representations 
of the originsd and the transformed and optimised version of the generated control/dataflow grapliu 

5«2 xppgcG 

This command is provided for comparing simulation results obtained with xppvcmake, xmap and 
xsim (or from execution on actual XPP hardware) with a "di*«cf* compilation of the C program 
with gcc on the host, xppgcc compiles the input program with gcc and binds it with predefined 
XPP^getst ream and XPP^put stream fnnctions. They read or write files port<n>^<m>.dat in the 

^In an extended cadesign compiler, the .xco file would also be used to generate the host pardtion of the program. 



2r 
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Vectoiizing C CompUcrXPP-VC 7 

current directoxy for n in 1..4 and m in 0..L For instance, the program ia Section 6.1 is cammed as 
follows: 

xppgcc -o streamfir streamfir.c 

The jresulting program strsamfir will read input data from portlJ).dat and write its j:^esults to port4J)^dat.^ 

6 Examples 
6*1 Stream Access 

The following program strsan^.c is a small example showing the usage of the XPP-jget stream and 
XPP.put stream functions. The infinite WHILE-loop implements a small FIR filter wluch leads input 
values irom port 1-0 and writes output values to port 4,0, The vanables xd, xdd and xddd arc used to 
store delayed input values. The compiler automatically generates a shift-^gister-like configi]ia.fiQn for 
liiese vaiiables. Since no operator dependences exist in the loop, the loop iterationd overlap aiitomadcally, 
leading to a pipelined FIR filter execution. 



1 #include "XPP^h" 
2 

3 mainO { 

4 int X/ xd/ xdd# xddd; 
5 

€ X 0; 

7 xd 0; 

8 xdd = 0; 

9 while <1) { 

10 xddd = xdd; 

11 xdd « xd; 

12 xd » x; 

13 XPP_jgetstreain(l, 0^ &x) ; 

14 XPP_putstreain(4r 0^ (2*x + 6*xd + 6*xdd + 2*xddd) » 4); 

15 } 

16 } 

After generating streamfiMbin with the command xppvcitiake streaxnfir .xbxn, the following 
comm^d reads the input file portlJDMat and writes the simulation results to xpp^ort4J>,dat, 

xsim -run 2000 -inl_0 portl_0*dat -out4_0 xppj>ort4_0*dat 
streamfir.xbin > /dev/null 

xpp^ort4J0.dat can now be compared with port4J^Aax generated by compiling the program with 
xppgcc and running it with the same portlJOjdat. 

f However, piognuns receiving inidal data from or writing result data to excexnal RAMs in xsim cannot be compared (0 
directly compiled programs using xppgcc. The results may al$o dUler if a bitwidch other Oiaix 32 i$ u$ed for the senerated 
NML files. 
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6^ Array Access 

^ ^ jc o*, TJre filter ooeratiM on affays. The first FOR-Ioop reads input data 

loop outputs y m port 4 J). 

1 #include "XPP-h" 

2 #defin« N 256 

3 int x[Nl, y[Nl; 

4 const int c[41 - t 2, 4, 4, 2 

5 inainO { 

6 int i, jf twp? 

7 for (i = 0? i < N; I 

8 xPP_getstreatft(l, 0, &tmp); 

9 x[il « tmpf 

? for (i - 0; i < N-3; i++> I 

12 tmp = 0; 

13 XPP_unroH()? 

14 for <j - 0; j < 4; 3++> i 

15 trap c[jl*x[i+3-:i3; 

16 } 

17 y[i+21 = tsnp; 



18 > 
19 
20 
21 } 



19 for (i = 0; i < N-3; 

20 xPP_put Stream (4, 0, y[i+21)J 



xppvciaake produces 



the following output; 



U'':^%^::e/S/t^=^i-xu« -paraixex -no po=uc^oi««o^kop4 
p,rtition .ir.yfir.wx:2 .rt.yfir.svo 

mam: FOR-^O^J > , be synthesized/ vectorized 

main! can be synthesized completely 
Program partitioning : synthesis- 

"'"^^"^SrSo? i line 7 Elected for pipeline synthesis 
"^•"^ vorIJIoop t line U selected for pipeline synthesxs 
lll-^l l: iine 19 selected for pipeline syntheses 

PaWc 
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» , , unrolling loop j 
porky -const-prop -scalariae --copy-prop -dead-code array- 
fir, ovo array fir,svol 

predep -normalize arrayf ir-svol arrayfir .svo2 

porky -ivar -know-bounds -fold arrayfir. svo2 arrayfir. svr 

nmlgen arrayfir •sur arrayfir •xco 

The messages from partition show that all loops can be vectorized. The dependence analysis did not 
find any loop-carried dependences preventing veccorization. The inner loop in the middle of the program 
1$ unrolled* The outer loop's body is effectively substimted by the following statem^t: 

y[i+2] = C[03*x[i4-3J + cEl]*x[i+2] + c[2]*x[i+ll + c[31*xtij; 

Since all remaining loops are innermost loops, they are selected for pipeline synthesis. Array reads» 
computations, and axx^ writes overlap* To reduce tfie number of array accesses, the compiler automatic 
cally removes redundant array reads, to the middle loop, only x [ i+3 ] is read. For xCi+2J»x[i+l] 
V and X C i 3 » delayed vertions of x f i+3 ] are used, forming a shift-registen Therefore, each loop itera- 

don needs only one cycle since one read from x, all computations, and one write to y can be executed 
concunnently. 

Finally* die following example program fragment is a 2-D edge detecdon algorithm. 

/* 3x3 horiz. + vert- edge detection in looth directions */ 
for{v=0; v<=VERLEN-3/ v++)' { 
for(h=0; h<-HORLEN-3; h+4-) { 

htmp « (pl[v+2][Ii3 - pl[v]fh3) + 

<pl[v4'21 -Pl[v][h-H21) + 

2 * (pl[v+2] [h+1] - pI[vHh+l]); 
if (htmp < 0) 
htmp « - htmp; 

vtmp « (pl[v] [h+2J - pl[v) [h]) + 
i.s. (pi[v+21 Ih+2] -plCv+23[hJ) + 

2 * <pl[v+13 £h+2J -plCv+lHhl); 
^ if (vtmp < 0) 

vtmp « - vtmp; 

eum s= htmp + vtmp; 
if (sum > 255) 

sum ^ 255; 
p2[v-l-lj [h+1] ^ sum; 

} 

> 

As the output of partition shows, both loops can be vectorized. Since only innermost loops can be 
pipelined, the outer loop is executed sequentially. CNote diat the line numbers in the program outputs are 
not obvious since only a program fragment is shown above.) 
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partition edge.spr2 edge.svo 
Program analysis: 

luain; FOR-LOOP line 22 can be synthesized/oan be vectorized 

mains £*OR~LOOP v, line 21 can be synthesiaed/can be vectorized 

main: can be synthesized completely 
Program partitioning: 

Entire program selected for XPP module synthesis. 

main: FOR-LOOP line 22 selected for pipeline synthesis 

main: FOR-LOOP line 21 selected for synthesis 

Alsonote.the followfaig additional features of this program: Address generatois for the 2-D array accesses 
ate automatically generated, and the axtay accesses are reduced by generadng shift-registers for each of 
the thiee image lines accessed. Purthennore, the conditonal statemeatits are implemented using SWAP 
(MUX) operators. Thus the streaxning of 0ie pipeline is not aSected by which teanch the conditional 
statements tate. 




7 Future Compiler Ext^osions 

Apart jftom removhig some of the testricticms of Secdon 3.1.2, the followhig extrasions are planned for 
XPP-VC 

7.1 Temporal Partidoxung 

By using the pragma function XPPjnextJSOitf()y programs are partitioned into several configurations 
which are Joaded and executed sequentially on the XPP Cora. Specific NML configuration commands are 
generated which also exploit XPP's soptiisticated configuration and preloading capabilities. Eventually, 
&e temporal partitions will be determined automatically. 

7^ Flrogram I^ransformations 

Rir more efficient XPP configuration generation, some program transformations are useful. In addition 
to loop unrolling, loop merging, loop distribution and loop tiling wUl be used to improve loop handling, 
i, e, enable more parallelism or better XPP usage. 

Furthermore, programs containing more than one function could be handled by inlining Junction calls, 
73 Codesign Compfler 

Tliis section sketches what an extended C compiler for an arciiitecture consisting of an XPP Core com- 
bined with a host processor might look like* The compiler should map suitable program parts, especially 
inner loops, to the XPP Core^ and the rest of the program to the host processor. L e., it is a host/XPP 
codesign compiler^ and the XPP Core acts as a coprocessor to the host processor. 

This compiIer*s input language is lull standard ANSI C* The user uses pragmas to annotate those pro- 
gram parts tiiat shoidd be executed by the XPP Core (manual partitioning). Tte compiler checks i£ the 
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selected pam can be implemented on the XPP. Program parts containing non-mappable opetations must 
be executed by the bost. 

The program parts running on the host processor ("SWO, and the parts running on the PAE ar- 
ray C'XPP") cooperate using predefined routines {copyjdataJtoJCPP, copyjdatoJtaJiosK stort^nfigfn), 
wait4or^copwc0$SQr^jkdsh(n), reguestuconjig(n)). For all XPP program parts, XPP configurations at« 
generated In the program code, the XPP part n is replaced by request^cor^g(n), startucpr^fnX 
watt4dr^oprocessor^nish{n)^ and the necessary data n^ovements- Since the SUIF compiler contains 
a C backend, the altered program (host parts witti coprocessor calls) can simply be written back to a C 
file and then processed by the native C compiler of the host processor* 

Thus the sequential control flow of the C program defines when XPP parts are configured into the XPP 
Core and executed. 
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Abstract 

TOe extreme Processing Platfom (XPP) technology frf- 
fers a mdque r^configurable computing platform supported 
with Q set of tools, A C compiler, whiclt integrates both , 
new and efficient conypiUttion techniques and temporal par- 
titioning, is presented. Tkmpoml partitioning guarantees^ 
the con^ilation {^programs with unlimited complexity as 
long as the supported C-subsei is used, A neW parttfiontig 
scheme, which permits to map large hops cfctQ^ kind and is 
neither constrained by loop'dependencies nor nested struc* 
tures, is also presented Furthermore, temporal partitioning 
is applied to reduce the configuration time overhead and 
thus can lead to performance gains* The compilatiortflvm 
C code to ^ configuration data,, ready to be downloaded 
onto die XPF, takes seconds for complex examples^ which 
is, as far as we know, not reproduced by ar^ other reetm" 
figurable computing technology. The compiler represents 
a step forward, by fiimishing a truly "push-button'* qp- 
proach, only comparable to microprocessor domains, and 
thus can spread the itse of the XPP technology and deal 
with time'tO'market pressures positively. 



1 Kntrodttctlon 

M^y of Goday*s applicadons ate characterized by inten* 
sive data-stream processing and high-perfbrmance require- 
ments. Such performance is more and mdre evident to not 
be accomplished with today's microprocefisor technology. 
Cbnventiona) processors (including DSPs) are geared for 
sequendai processing. Miild-I>SP and very large instnic- 
don wotd (VUW) processors sdll have severe memory bot- 
denecks, lack th6 number of data ports required to support 
multi-channel, high speed data streams, and fall on fvmisb^ 
lag low power con3umpdon solutions. Accelerating sp^ 
ciilc functions using applicatlon-specilic integrated circuits 
(ASICs) relieves some of the processing burden, adds some 
required features, but Bmlts flexibility and requires expen- 
sive non-iecutting engineering (NKB) costs and long design 



cycles. High density field-programmable gate arrays (FP- 
GAs) eliminate the NRE costs, add flexibiliQf, but sdll re>- 
quire long timing optlmizadons and verification cycles and 
low level hardware efforts. Addidonaliy, the fine-grained 
structure adopted in FPGAs is not suitable to map at the 
algorithndc level, which is proved by the well-known diffi- 
culdes to have a ''push-button*^ high-level methodology to 
program these aichitectures 

New reconligurable processing uxnt$ ()Ey?Us) are being 
introduced ttying to solve those pn^blems [1]. One of the 
new promising architeciures is the XPP [2][3]. The XPP 
is a coarse-grained, runtime-reconfigur^le, 2*D array par^ 
allel stmcmre. The architectiu^ was designed to facilitate 
programming and to support pipelining, dataflow computa- 
tions, and parallelism &om the instruction to die task level 
efSciendy. Therefore, this technology is weU suited for ap- 
plications in multimedia, telecommunications, simuladon, 
digital signal proces^ng, and similar stream-based applica- 
tion domains. The XPP architecture also supports dynamic 
self-reoonfigtuadon In a user transparent way. In order to 
drastically reduce die time to program the XP£^ and to keep 
the user from architecture details, a high-level compiler in- 
tegrating tempoial pArtitiomng is required. Such a compiler 
is the ma^ topic of this paper. 

This paper is organized as follows. The next section 
introduces brie% the XPP technology. Section 3 outlines 
Gonspilation to the XPP and section 4 describes die ^mpo- 
ral pEurddoning steps. Section 5 shows some experimental 
results, section 6 points out the m^n differences between 
this and previous works, and finally section 7 concludes th& 
paper and enumerates ongoing and future work planned. 



2 XFP Tedmology 

The XPP technology consists of a reconfigurable com- 
puting platform delivered a$ a device or an intellectual prop- 
erty CIP) core, and a complete development tool suite (XDS) 
(2]. An XPP can be used as a coprocessor for CPU and DSP 
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arthitectures. Apriorvejsion of technology hasi^ulted 
m the XPU12&-ES (31, a prototype device^ which was piX>- 
duced jn silicon. 

The XPP architecture is based on a hierarchical array 
of coarse-grain, adaptive coimputjing elements called Fr<h 
cessing Array Elements (PAEs)^ and ^padost^on&uedcom* 
mmicanon n^wofk, Tha strength of the XPP technology 
onginates from the combination of anay processing with 
unique and powexfuJ run-time reconfiguration mechanisms. 
Dil^crent tasks or applications can be configured and run 
independently on different paits of the array. Reconflgura- 
don is triggered externally or even by spedal evcmt ^gnals 
originating within the air^y. enabling self-reconfigarxng de- 
signSw By u&Iisang protocols implemented in hardwarep data 
and event packets are used to process* generate, decompose 
and mei^ge streams of data. 

2-1 Array Sfcmctsire 

An XPP contains one or several Processing Array Clus^ 
ters (PACs)j Le., rectaiiigolar blocks of PABs. Fig.l shows 
the structure of a typical XPP device^ It contains four PACs 
(see top left-hand side). Each "MC is attached to a Configu* 
radon Manager (CM) responsible for writing configu^on 
data into the configurable objects of the PAC using a ded- 
icated bus. Multi-PAC XPPs contain additional CMs for 
configuration data handling, forming a hierarchical tree of 
CM& The root CM is called the supervising CM (SCM). 
U has an external Inteiface (dotted arrow ori|jnadng ftom 
the la Fig.!) which usually connects ^ SCM to an 
exeenial configuradon memory. A CM consists of a state 
machine and internal RAM for configuration caching (see 
top right-hand side of Fig-l). 

Horizontal busses can^ data and events. They can be 
segmented by configurable switch-objects, and connected 
to PAEs and specnal I/O objects at the periphery of the de- 
vice. The VO 6l:9ects can be used for data-streaming or 
to access external lesoutces (e.g.v memories). A column 
Of ports to the coiTBsponding leaf CM Is located on the 
array. A CMPort can be used to send events to the CM 
from the array. The typical PAE shown in Fig.l (bottom 
center) contains three objects! one PREG (forward reg- 
ister), one BR£P (backward register) and one ALU. The 
VBSa object is used for vertical forward routing (with a 
programms^le numbw of register stagesX or* to perform 
MERGE, SWAP or DBMUX operations (for oontrolled 
stream manipulations). The BRBG object is used for verti- 
cal backward routing (registered or not), or to perform some 
selected arithmetie operations (e.g., ADD, SXJB, $HIFI^. 
The BREGs can also be used to perform logical operations 
on events. Each ALU (see its internal strtictura on the bot- 
tom lefk-hand side of FIg.l) performs common two-input 
fixed-point arithmetical and lo^eal operations, and com- 



parisons. A MAC (multiply and accumulate) operation can 
be performed using the ALU and the BREG objects of one 
FA£ in a single clock cycle. 

Another standard PAE object is the memoty object 
which can be used in FIFO mode or as RAM for lookup, 
tables, intermediate results, etc. If such objects are needed 
they are located in the left and/or right Mlumns of PAEs of 
each PAC. However, any PAE object ftanerionality can be 
included in the XPP architecture. 

A set of parameterizable features can be tised to fur- 
nish an XPP that best lits to user and application demands. 
Those features include; the numberofR^Cs and their PAEs, 
number of interna] miemories, nund?er of VO ports, number 
Of buses, word bUwIdth, cache ^ze, deptii of the HFO to 
configure eaidi object, etc 




Figure 1: XPP atchitecture. 



2.2 Packet Handling and Synehronizatlott 

PAE objects as defined above communicate via a packet- 
oriented network. Two Qrpes of packets are sent through the 
array: data and event packets. Data packets have a uniform 
bitwidtii specific to the XPP Qotq or device. 

In normal operation mode. PAB objects are self- 
s^chronizing. An operation is performed as soon as all 
necessary data jEnput packets are available, l^e results are 
forwarded as soon as diey are computed and the previous 
results have been consumed. Thus, a signal-Qow graph can 
be mapped direcdy to the ALU objects and data-streams can 
now through them in a pipelined manner without adding 
specific hardware. 

Event packets are one bit wide. They transmit state in- 
formation which controls ALU execution and packet gener- 
ation. For instance, tiiey can be used to control die merg- 
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ing of datd-stieams or to deliberately discard data packets. 
Thus, con<fitioiid ooflopuiations depending on the resulis of 
earlier ALU operations ai» feasible. Events cdn even trigger 
a self-teconfigiiration of the device as explained below. 

Each data or event packet is only forwarded if the pre- 
^ous one has already been consumed. The comxnvnication 
qrstem was designed to transmit one packet on each inter- 
connect per c^cle. Hazdware protocols ensure that no pack- 
ets are Iost» even in iho case of pipeline stalls or during die 
configuration process. This simpUfies application develop- 
ment considerably. No explidt sdhedullng <tf operatjipns 
is requiredi, 

23 ConflguratloiB 

The XPP atchitecture is optimized for rapid and user^ 
transparent configuratiofl. For this pwrpose» the configina- 
tion mani^ers in the CM tree operate independently (with- ; 
out global synchronization), and therefore are able to con- 
figure their respective paits of the amy in parallel. Ev- 
ery PA£ stoies locally its current configuration state» i.e., 
if it is part of a ooniigtiradon or not (states '^configured** 
or ''free**). Once a PAE is configured, it changes its state 
to "configured^. TMs prevents the respective CM ftoni te- 
configuring a PAE which is still in use. The CM caches 
the configuration data in its internal RAM and constantly 
tries to configure the objects used by the next configuradon 
requested. Each XPP object ha$ a configmration FIFO 
which stores data of subsequent oonfiguratioms. Once an 
Inject has been released (state **frBa**). the next configura- 
tion >vord in its FIFO is loaded immediately. Hence it is 
possible to reconfigiiot: partially in one clock cycle. Addi- 
tionally, a piefetchuig mechaiusni is used. WWle a config- 
uradon is being loaded onto the FIFO of each object, other 
configuradons may already be requested and cached in the 
low-level CMS' internal' RAM- Thus, it does not need to be 
requested all die way ficom the SCM down to the array when 
objects become available. While loadlnqg a oontfigmration* 
its PAEs start &eiir pari of the compvtatfons ti^ SOOtt as 
they are in State '^configtired''. 

Each ALU object has an input event port that triggers 
the self^releasing of its resources and of all of the objects 
connected to it. Such event is successively broadcasted ac- 
cording to the interconnecdons. 

Because of its course-gtain nature, an XPP device can 
be configured rapidly. Since only ^ configuration of those 
array objects actually used is necessary, the configuration 
dme depends on the application. 

%A Development Tools 

The XPP can be programmed by using the Native Map* 
ping lansms^ ip4ML) [2], a PACT proprietary structural 



language with leconfigutatott primitives. It gives the pro- 
grammer direct access to all hardware features. In NMJU 
configurations oondst of modules which are spedfied as 
in a structural hardware desoripdon language, similar to, 
for instance, structural VHDL. PAE objects are explicitly 
allocatcdf opdonally placed, and their connecdons specie 
fied. Addidonaily* NML includes statements to support 
Gonfiguiation handling. Thus, configuratSon liandling is 
an e^v^lScIt part of the NML application program. XDS 
is an inte^ted environment for programming with KML. 
The main componertt is the mapper xmap which compiles 
NML source files^ places and routes the objects, and gener* 
ates XPP binary files, xmap uses an enhai>ced force^based 
placer with short runtimes. The XPP binaries can either be 
simulated and visuaU%ed cycle by cycle with the xsim and 
xvi 8 tools* or directly executed on an XPP device. A high- 
level compiler, described iii the next secdon. has been added 
to XDS and permits to map C programs omo thfi XPP. 

Reconfiguration and prefetching requests can be issued 
by ai^ CM in the tree (including the SCM which can re- 
spond to eTctemal requests) and also by event signals gen- 
erated In the array itself. Rmming modifiSes can do a self- 
releasing of their resources and requess another config* 
uration. Thus, it is possible to execute ari application con- 
sisdng of several conflgutations without any external con- 
trol. 

The CM of the XPP permits to exploit specnlative c<ya- 
figuratlon^, i.e., the configumdon of a module possibly 
used after die current one has firdshed execudon. If the path 
which includes that module is taken, the CM only has to 
trigger the execution of the configuration (see the secdon of 
die MML code in Fig.2 and the stmuladon performed with 
xsim in Fig,3. where confl.MOD2 is speculatively config- 
ured during die execudon of conCMODOX H this path is 
not taken, the CM triggers the releasing of the resources 
already configured and requests the other conflguradon. 



3 OumpHtog C Code wiiii XPP-VC 

The XPP Vectorising C Compiler XPP-VC is based on 
the SUlF compiler framework [4], SUIF is used because 
of its easily extensible properties. The XPP-VC compila- 
tion flow is shown in FigA An opdons file, used by the 
compiler, specifies the parameters of the targeted XPP and 
the external memories connected to die XPP. lb access XPP 
yO ports specific C-funcdons are provided. 

^This .$im2lariiies to speculative execution. In ih!s ca9% before 
knowing if A coaR^wation, wlU be requested, its canS^undon is Atoned. 
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CONFIG conCMODO I 
CONPJ^ODUI-E(MODC0 // r^tiest the cortfigutation of MODO 
REQUESK«aiLMOD2-5pec) //5ian specutedve ooniigufadon 
// if (MODOiCMPonO ^ "V*) then CGaLMCDXjUt&c i$ tcquested 
CONF CN«^M{MODO,CMPonO»con04005U»«^ J 
// if {MODOXrMPWl aa "(n flien conLMODl is requested 
GONFjC»fiK>R3XMOD0.CMPgrtl« oonLMODl« ^ 

CONFKG coflCBixOD!iL$p«c ( // ceqnest the configuration of MOD2 
<X>NF_MOPU]US(MOD29 ^t <to not ston U 

G0N7X0 conCMODXjexec { // MOD2 i& taken 
Sne7<M01>2»Start.A » t) // en^le Ibe start of computuiigr of M0D2 
RE<]^nE&nr(conf^OD3) // leqiiest the next configuration 

COl^G conJLMODl I //MODI is laljen 
REOU3SWcanCMOD2_rcc) //rtsJeasing of xiwotirt6$ of MOD2 
CONFJ^OOtjtt1S(MODl) // request the 
R£QU£$T(«oaLMOD3) //request the next configmaiion 

RECONf (MODI ^Snn) Rleaae the resoutees of MODI 

) 



JRgure 2: SeoUon of NML deseribing the oonuol flow. 



The compiler starts with ^oma architecture-fnclepeatleiit 
pf^ropes$ing passes based pn wen^known comi^ation 
techniques [5]. During this step, FOR loops are automat- 
ically unrolled if instructed by the piogrammer. Then the 
compiler performs a data-dependence analysis, The corn- 
piler tries to vectorise inner program FOR-Ioops« In XPP- 
VC» veciori2ation means that loop iterations axe overlapped 
and executed in a pipelined, parallel fashion. This technique 
is based on the Pipeline Vectorization method developed for 
TBconfigurable arebiteetares [6]. 

The C program can be manually splitted in several mod* 
ules by using annotations. Otherv/iset automatic temporal 
partitioning can be applied (see section 4) in order to furnish 
mappablc modules and to reduce the ove^U latency, 

MODGen generates one NML module for each temporal 
partition. Fiist^ program daia is allocated on the XFP. By 
de£iult» MODGen maps each program array to interna! or 
external 'RAM while scalar variables are stored In registers 
within the PABs. Next, acontrot/datafiow graph (CDFG) is 
generated. Straight-line code without army accesses can be 
directly mapped to a data-flow graph since the data depen* 
dences are obvious in the DAG representation. One ALU 
is allocated for each operator in the CDFG. ^Because of the 
seIf-syiiehxoni2atioa of operators on the XPP, no explicit 
control or scheduling is needed. Ilie same is true for condi- 
donal execution of such blocks. Both branches are executed 
in parallel and MUX, operators select the correct output (and 
discard the other one) depending on the condition. This 
data-driyen execution of the operators autonoaticany yield$ 
instruction-level parallelism. In contrast^ accesses to the 
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Figure 3: Speculadve configuration (enables earlier activa- 
tion of M0D2). 



same array have to be controlled explicitly to maintain the 
correct execution order. MERGE operators (which select 
one input without discarding the other one) route address 
and write data packets in the correct order to the RAM» and 
DEh/iUX operators route read data packets to the correct 
subsequent operator. State macUnes for generating the cor- 
rect sequence of event signals (to control these opemtors) 
are synthezised by the compOer, For condidonal branches, 
containing aixay accesses or inner loops, DBMUX opera* 
tors controlled by the IF condition route data packets only 
to die selected branch, and output values are taken from the 
branch activated. Thus» only selected brandies recdve data 
packets and execute. 

In loops* all variables updated in the loop body are han- 
dled as follows. The fizst iteradon uses an input packet for 
the variable's valae^ and the subsequent iterations use pack- ' 
ets generated in the previous iteration. In all but the last 
iteradon, a DEMUX operator routes the outputs of the loop 
body bade to the body inputs. Only the results of the last 
iteradon are routed to the toop output by the DEMUX op^i^ 
ators. The control packets for the DHMUX are genmted by 
the loop counter or the comparator evaluating the exit con« 
dition. Wote that the internal operators* outputs cannot just 
be connected to subsequent operators since they produce a 
tesult in each loop iteration. The required last packet would 
be hidden by a stream of Intennediate packets. If army ac- 
cesses are present^ a loop iteradon may only be started after 
the previous iteration has terminated because the original 
access Order must be maintained. This is enforced by event 
signals. 

For genemdng more eSxdient XPP con^gurations« MOD- 
0^ generates f^pelined operator networics for inner pro« 
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Figure 4: XFF-VC eon^Iacioa flow. 



gram loops which have been wnotated for vectoriz^don by 
the preprocessing step. In other words, subsequent loop 
iterations are started before previous iceradons have fin- 
ished. Data packets flow dontinudusly through the Operator 
{^pelines. By applying pipeh'ne balancing techniques, max- 
imum throughput is achieved. For many programs, addi-- 
donal perfbrmance gains are achieved by the complete loop 
unrolling transfomaation. Although unrolled loops require 
usually mofe XPP resooices, diey yield more parallelism 
and better exploitation of the XPR lb reduce the number of 
array accesses, the compiler autoraadcally removes redun- 
dant array reads. When array references inside loops access 
subsequent element positions the compiler only uses one 
reference and generates deha^ed structures^ forming shift-' 



Finally, each module generated by MODGen is placed 
and routed automatically by xmap- 

The XPP-VC compiler cutcently supports a C-subset 
sufEicient for programming real applications. Struct data 
types* painier operations, inegular control flow (breaks 
continue, goto, labelX ^d recursive and operating 
system calls are not supported or cannot be mapped to the 
XPR 



4 IbxnpcMral Partitioning 

A program too large to fit in an XPP can be handled by 
splitting it in several parts (configurations) such that each 
one is mappable. Temporal partitioning permits the auto- 
matic exposing of configurations such that the overall exe« 
cution time of the application is miinimizcd and is sucess- 
fkilly mapped onto the XPP resources. It considers the costs 
to load into Che cache, to configure and to execute each con- 
figumtion wifli the XPR An important strategy that is con- 
sidered Is to pre-fetch configurations while another is be- 
ing configured or is running. Arrays of constants or with 
pre-de<ined values used in one or more configurations can 
be initialized in parallel with the execution of the previous 
conflguFadons* This takes advantage of the initializadon of 
the array caiiied out by using the configuration bus. 

The set of partiiions resulting from the splitting are then 
processed by MODGent generating a set of configurations. 
Next, specific NML configuration commands are generated 
which also exploit XPP*s sophisticated configuration and 
pre-fetching capabilities^ and specify fhe configuration con- 
trol flow that Is orchestrated by the CM, 

4*1 Benefits of Temporal F^tlonti^ 

Temporal panidoning targeting the XPP can reduce, 
when efficiently applied, the overall execution time. Such 
reduction can be mainly achieved by the following is^ 
sues: (0 reduction of each pardtion compIe;dQr can reduce 
the interconnecdon delays (long interconnecdons may pass 
through re^sters and thus add clock cycle delays); (2) re- 
ducdon of the number of references, in the section of the 
program related to each pardtion, udng the same resource, 
by distributing the oyerall references among paxUtlons, can 
lead to performance gahis as well. This happens with the 
statements presented in the progiam referring the same ar- 
ray; (3) reduction of the ovetall configuration overhead by 
overlapping fetching, configuradon and etecudon of dis- 
tinct parddons. 

Example: Consider the C exaix^le max^avg shown 
in Fig.5. Configuradon boundaries are represented by 
XPP^nextu^nfO statements. They define four configura- 
tions in the code (see Fig.6). Apart from exposing temporal 
partitions in such a way that the mapping to XPF is accom- 
plished, combining only the most frequently taken condi- 
donal padis in die same panidon can reduce die total ex- 
ecution dme by substandally reducing die r^sonfiguradon 
dme (since the parddons for the other paths are not config- 
ured when they are not taken), Fig.6 presents such a case. 
If the path bb^O and bb.l has been identified a$ the most 
frequently executed, this path can be in the same partition^. 

'ihil dQplication of bb^? ^wovld pemiit te tiave & confisutaiicM with 
|bb„abb.l.bb^) wid«n«th«rOn« with {bbje,bb^)-. 
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In such a case, the configuration vtlBttd to bbj2 will only 
be call^ when the most frequent path has not been taken. 

// max_avg example 

if (op=»l) ( // average kd£Ael 

> 

^p_next^Gon£ () ; 
} dlsd ( // max kernel 

max 0; 

iirCxCi] > Max> max - xti]; 

} 



Hgure 5; Exampte with cwo con^cionally executed kernels, 
and with configuration bounclariiss represented* 




Figure 6: CFG of the algorithm shown in Fig^. lines 
crossing edges represent the XPP^next^conff) st^einents In 
the code. Bubbles containing basic blocks represent there* 
gions to be implemented In dlCferent pardtsons. 

Since configuradoR takes many clock cycles^ it i& in most 
Cases prefeiable to rouse a configuradon as much long as 
possible in order to joeduce the reconfigunition time over* 
head. Thus, loops in the source code are always good candi- 
dates CO be ecitbely impleniented a ^gle conHgoradon. 

42 Parddoiiing Loops 

Each loop that does not fit onto the XPP can be dealt with 
by performing loop distribution £5] (if applicable) or by par- 
dtionlng the loop and use the CM to oins^hestTate the control 
flow. Cuirendy, loop distdbudon is not autotnadcally ap- 
plied. Instead* we propose a new method to partidon com- 
plex loops without restdcdons. All the loops which their 
bodies must be partitioned are transformed into straight line 
code with a jump to loop-exit or to the next iteration in or- 
der that each pairddon can be compiled by MODGen. Fig,? 



shows an example of such transforxnadon without die state- 
ments needed to communicate the value of scalar variables 
between configurations. Bach configuration requests die 
next configuradon to be taken (if none is requested then the 
application terminates and the last configuradon releases its 
resources). Depending on the value of the i<N condition, 
config. #2 takes two di^eient exits, which requests #3 or#4 
respectively. Since config. #3 always requests #2« at the end 
of its execution, the inidal behavior of the loop is preserved. 
The temporal pardtioning creates two addidonal configura- 
don boundaries to preserve the inidal funcdonallity. From 
Flg.7b can be seen diat configuradon boundaries were in- 
serted before and after the if statement These boundaries 
are needed ^nce the code before and after will be executed 
once and both the if header and body will iterate N+I and 
(9 dmes respectively. 



£or{i-Q;l<K;x4~i-) { Xabls i.£(i<Iil) f »2 

stmssl; etmcx; #2 

SEjnextLJsoa^Ol atmtS; t3 

filsmta; X**; #3 

stint 3j 

a) b) o) 



Figure 7: Example of the transformation applied for par- 
ddoning loops, a) original code added with the statement 
representing where the loop is partitioned; b) transfonned 
code: c) configuration ID for each statement in b), 

43 Automatic Partitioiiing 

Rom Che SUIF representation of dse C source code the 
temporal partztiomng phase constructs an Hienuchical Task 
Graph^ mtended. HTG<^. This extended gr^h has two^ 
types of nodes: (1) behavioral nodes representing lines 
of code in the input program; G2) army nodes represent- 
ing each array existent in the source code. For instance;, 
Hg.8 shows die top level of the HTG'f for an implemen- 
tation of the DCT (Discrete Cbsine Transfonn) based on 
mahix multiplications. Type (I) nodes have tiuee distinct 
sub-types: (a) bhclc nodes represendng basic blocks; (b) 
compound nodes representing if-t:hen-else structures; 
(c) loop nodes representing die loops (for, while). Loop 
and compound nodes explicitiy embody hierarchical levels. 
Sdges in the HTO-h represent data communication between 
two nodes or just enforce execudon*s precedence. 

Each behavioral node of the HTG+ is labeled widi die 
following information (s ome of the labelling steps require 

^The mod^l im been chosen, liGc&iise It also exposes loop And ta^k 
level pAralelisn). 
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estimation efforts): (l) block and compound nodes: num- 
bw of ALUs and K&O^ (2) loop nodes: number of iter- 
^oas (if unbound, profiling can be used), and nuxnber of 
ALUs and REGs; (3) army nodes: the size of the way, 
typ6 of the elements, and, when they do exist, Che initializar 
tion vaducs. Each edge between two behavioral nodes of the 
HTO+ is labeled with the number of data words that must 
be dansferted between the two nodes. Each edge between 
an anay and a behavioral node in the HT&i- is labeled with 
die number of load and store references in the source code 
represented by the behavioral node to that* particular acr^. 
The esrimated number of times that each load and store tef- 
eience wilLbe executed i$ also collecced. The use of the 
same azray by different behavioral nodes» increases the ex- 
ecution latency suid the number of resources needed for diis 
partition'*. 

Temp^art uses three types of estimadons: (1) nun^r 
of XFP lesource units needed by the configuration imple- 
inendng a single or a set of behavior nodes; (2) latency for 
a behavior node or a set of connected behavior nodes on the 
HTG^ (this does not need to be accurate to the real exe- 
cution time and only needs to have relative accuracy): (3) 
number of clock cycles to fetch and configure each parti- 
tion (calculated based on the number of configuration words 
needed, which is computed with the e^dmatlon of die re- 
soutces needed directly fifom the SUIF representation or 
widi die number of edges. ALUs, REGs, and pre-defined 
values existent in the NML graph generated by MODGen). 




Figure 8: Top level of the HT(j+ for the DCJT ejtample (this 
top level consists of 4 loops). Circles and boxes represent 
behavioral and array nodes respectively. Data is read from 
an input pott (Loopl) and written to an output port (Loop4)« 



'*5rg,r twice the number of xeferenecft CO the same RAM leads to mwt 
than MVice the atioiber of objects lequired on XPP «n4 delays each Bcce$$ 
because of (he objects needed to MERGE and DBMUX dala and address 
packets* l-!ence, combining several behavioral nodes in one partition incuis 
ao ovediead wnicD Ss computed during the temporal partitioning a!«Diithm. 



The temporal partidoning algorithm starts with a parti- 
tion for each node on the top of the and ttien merges 
iterativeliy adjacent partitions untill no performance gains 
are achieved considering the maximum available size for 
each pazdtion. Each pardtton must currentiy define^ on the 
control fiow graph (CFG) of the program, regior^ of code 
with all entdes to the same instruction and pos$ib]ly multiple 
exists. The algorithm considers the overlapping of conflgu- 
radon and execution with feteh during the qtei^ng of parti- 
dons. The algorithm starts with the granulaxicy of the nodes 
In die HTG+ and only if a block node cannot be mapped 
it considers partidoning at the statement or snb-blo^ lev^L 
Thus, the granularity of the algorithm adapts according to 
the application needs. 

The temporal particioning strategy only exploits configU'* 
radon boundaries inside loop bodies if an entire loop cannot 
be mapped to the XPP or contains more dian one inner loop 
in the same level of the loop body. If diese cases occur, the 
algorithm is applied hierarddcaliy to the body of the loop. 

SUIF Rapmaantation annotated wfth data-d^endenciea 



HTiS4> Generaeon 



0 <■ alpha -t 1 



— ^Temporal Paftfltoning AigoriUwi ] ■ | XPP Parameters 



eaeti TPli aiaBfTPO Ma3cSteB(l-Bl}ilia> 




[ CtiQCKSiz&<XMAP) ] 



leiayjffprm no 




TE^ done (frcezo TPl «ma Ihe «eM«M HTG^ RodeiA 

Figure 9: Automatic temporal parddoning methodology. 

Fig.9 shows the methodology which uses three levels 
(the computadonal efforts increase from the first to the 
third level): (1) Ibmporal Partitioning algorithm based on 
die estimadon of the needed resources done with func* 
don costs based on die number and kind of operations in 
die soutce code. The algorithm uses the HTGk* and the 
SUIF representation of the program; (2) For each config- 
iirdtion, selec:ted in the first levels the estimated sizes are 
checked with the ones esdmated by generating the NML 
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gniph with MODGen, If the size svrpa$ses the available 
resonrtes, the algorithm tenin level (1), telaxing the size 
constraiot (diniinui$hittg the niaximtim number of avail- 
able resouxce$); (3) Check if each configuration success- 
fully checked in level (2) can be really mapped to the XPP. 
This level uses functions of the mapper, placer and router. 
If the configuration cannot be implemented in the XPP, the 
algorithm rttums to level (I), once more relaxing the size 
constraint. The size constraint is relaxed by reducing the 
alpha parametier id each badkward iteraiioa (see Fig^)« 

After exposing the configurations with TenqpPart, the 
compiler introduces the statements needed to commdinicate 
scalar variables between partitions (see Fig.30). Arrays are 
used as inter-partition storage for scalar vod^les too, since 
only RAMs (to which the airays are mapped) loeep thdr data 
during teconfiguradon. TempPazm also ensures that arrays 
used by more dian one configuration^ or by the same config« 
uration loaded more than once onto the XPP. are bound to *^ 
the $ame memory locadon and such location is not used by 
other arrays during the lifetime of the array variable. The 
assignmcnc of all arrays (the initially used in the source 
code plus the added ones to communicate data) to the in- 
ternal memories is done based on the lifearaes of the arjays 
determined by the sequence of configurations that were pie- 
vieulsy exposed in the input program. This permits, in some 
caseSy to use less internal memories since they can be time 
shared, among different configurations. 

int eotnintl]; 

, , - • • • 

a o b C; a a b * Cf #3. 

X?9_n«9C^cson£<>; COituntO] a? #1 ' 

4 ^ a/Bf a « conini(03; #2 

d « a/ei *2 

a) b) c> 

Figure 10: Example illustrating the communication of the 
value of a scalar variable between two configurations, a) 
source code; b) code with statements inserted to buffer the 
data; c) configuration ID for each of the statements in b). 



4*4 Generating tite NML AppBcatfon 

Each partition is input to MODGen, which generates the 
NML structure to be mapped to the XPP* MODGen gener- 
ates, for each exit point existent in each partition, an event 
connected lo one of the CM ports available in the XPP 
(the CM can check if an event is generated and can pro- 
ceed with different configtirations based on the value of the 
event). The compiler generates both the NML representa- 
tion of each panidon and the NML secdon specifying the 
control flow of configurations- Sttch control flow Is orches- 
trated by the CM of the XPP during runame» as has been 



abready explained. 

The compiler also generates NML code considering tfae 
pre-fetch (load of a configuration to the cache of the XPP) 
of configuradons. The compiler can furnish tv/o difierent 
strategies: (1) request of the pr^-fetch of an configurations 
existent in dte application in the start of die execution; (2) 
request in each configuration of the pxe»fetch of the next. 
The request is done before the start of the configuration 
step for the current configuration. Strata (1) Is used most 
of the times. However, there are cases where using (2) is 
better. In the presence of several nested if-then-^ise 
stmctures with different configuratims for each branch, a 
pre-fetch sequence defined at compile dme can introduce 
too much overhead. 



5 E»perimentsO Resmlts 

lbb.l shows some results obtained when con^iltng a set 
of bendimail^ with die XXn^-VC. Note ^t none of the 
«2icamples show^ was $pecf ally coded to exploit more ef- 
JSdently thearchitectujralfenturesof t]!ieXPF(e,g,.para- 
tioning and distribution of arrays among die internal mem- 
ories) and thus the results can be further improvedr An XPP 
Core witfi a single PAC was used. The 2nd column repre- 
sents the size of the PAC (number of columns and rows of 
PAEs) used fbr each example. Columns #cf, #PAB» #Laiv 
and ffmax r^resent the number of oonfiguiations, luint- 
ber of PAJBs used (it is shown the maximum number of 
PABs of the largest configuration and the total number of 
FAEs virtually needed), overall latency (taken into account 
setupv fetching, configuration, data conimimicationandexe<- 
cution), and die m^imum number of objects executing per 
^cte respectively. The last column shows the CPU time 
(using a Pentium in <@933MHz vidtii Linux) to compile 
each example (from the source program to the generation , 
of the binary configuration file). 

DCTl is a 8x8 discrete cosine transform implementa- 
tion which is based on two maoix multiplications. The al- 
gorithm uses 6 loops for the multiplications and 2 loops to 
stream I/O data. It is purelly sequential (no unrolling is 
used). Tbmporal partitioning improves the overall latency 
ofDCTl by 13%anduses3i PAEs (witiiout partitioning 51 
PAEs are used). Thus it can use a smaller XPP core. BCT2 
uses the DCT kernel of bCTl and traverses an input im- 
age of a pre-defined size (1 6x16 is used). It uses 2 external 
memories to load/store the image and 2 internal RAMs for 
intermediate results and to store the coeffidents. The ver- 
sion with 6 configurations was obtained p^orming tempo- 
ral partitioning. Since the example has two outer loops die 
scheme to partitioning loops was ^plied (the compiler uses 
one configuration boundary between the two main loops of 
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the DCT kernel). "With this scheme, a gain of 4% in per- 
formance was achieved using 30% less.PABs. Chen is a. 
pointer-fifee vbr$ion, with 1 80 lines of C, of a DCT Impte- ' 
mentation used in JPBG. Tempoml paitiiioning fumished 
an improved version: 66% in perfinmance using 11% less. 
PAEs. The computation and data^mmunication is per- 
formed in 688 cioclc cycles, smooth represents an tms^e fil- 
ter ( 1 6x i 6 image). The two inner loops (3x3 window) were 
annotated to be unrolled and conducted to an ef^cient vec* 
torization. An overall speedup of 4 (8.6 consideHng only 
execution time) over the implementation obtained without 
unrolling is obtained, Aditionally^ 2 less PAEs m used 
with unrolling. Haar is an inqplementacion of the foiwatd 
2D Haar wavelet transform. An input image of 16x16 Is 
used. A pcrforaiance gain of 36% is achieved when tempo- 
ra] partitioning j$ applied, FIR is a ID FIR illter with 12 
taps filtering 2048 samples. Even with all the overheads^ 
0.42 sampies/cycle is computed (0.87^ considering only the 
latency to communicate data to extemal memories and the 
FIR computation). 

Each one of the examples was compiled in Jess than S 
seconds. This reveals that it is possible to have runtimes 
comparable to the ones achieved by software compilation. 
Performance gains obtained with temporal pardtioning ate 
shown. Since we nm most examples widi small data and 
image sequences, the configuration overhead is significant 

Note that the.cucrent methodology does not use neither 
the full poiendaliiies of the XPP nor some optimizadons: 
(1) The execution of a partition only starts after the full con- 
figuration of its resources; (2) No pipelining between fetch 
and configuration for the same pardtion has been used; (3) * 
The capacity of the XPP to configure concurrently disdnct 
PACs was not used; (4) An aibicrary order for pre-fetohing 
of configurations conditionally requested is used (the order 
should be based on the most frequently taken padi, e.g.t de- 
temuned by profiling); (5) The configuration FIFOs in each 
airay-object were not used. Hence, the performance results 



can be farther inqkroved. 



6 Related Work 

The XPP technology offers a promising reconfigurable 
computing platform. Being a step forward in the context of 
reconfigurable computing, it permits to attack some of the 
well-known deficiencies of related tiechnologies. The foi- 
iQiwiAg sub-sections Illustrate the most closely related work 
and reveals the most ixnportent differences. 

$.1 High-Levd GompOatSon 

TTie work on con^iling high-level descripdons onto xe* 
configurable logic has been the focus of many xeseaicchera 
since the first simple attempts [7]. Most of this work targets 
FPQA devices and thus need logic synthesis, even when 
module generators are Included in the compilation fiow» as 
is the case with die MARGE [8] compilcn In addition; such 
approaches also need backend mapjjnng, place and route» 
which are very time consuming with FPGA technology. 
Even when pre-placed and pie-routed conqwn^ts are used 
to assist the compilation flow» the compiladon time is still 
in the order of minutes or hours. 

New approaches have been used, which target loseaxch 
^hitectures. One of those approaches is the Garp-C com- 
piler [9]- Although it Is used for a reconfigware/software 
architecture, the configuration bit stream generadon, based 
on exploitation of instrucdon-level parallelism beyond ba- 
sic blocks and assisted with fast mapping and placement 
tasks permits to target fine-grain neconfigutable architec- 
tures efficiently with short compilation times. 

As Garp-C and MAltGB, XPP-VC also uses the SUIF 
compiler front-end. The generation of the hardware struc- 
ture to be mapped to the XPP is assisted with the pipeline 
vectorization ideas presented in [6]. However, the gener- 
ation of the control structure, based on the event packets 
of die XPP is completly new. Since the XPP is a coarse- 
grained archicectuTBt which directly supports ariChmedc and 
other operations oceuiing In high-level languages, there 1$ 
no need for complex ^thesis and mapping. The control 
structure is sdso directly mc^ed co objects handling events. 

6^ XBgh-lL^v^TempfMrallPart&tROnliis 

Temporal partitioning at die behaviomi level has been al* 
ready successfully conducted for FPGAs and other typ& of 
RPUs. The majority of die cunent approaches try to use a 
minimum number of configuradons by using all the possi- 
ble RPU size available for each temporal panition (see, for 
instance, [10]). Such schemes only consider another parti- 
don after the current one has filled the available resources 
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$ni are insensible to the optimizadon that must he applied 
to reduce the overall execution by overlapping the fetching, 
configuration and execudon steps* Albeic ttoi oonsidedng 
sach optiniizations, ILP formulations presented by some 
authors [11] are uncapable to deal widi the ctyoipl&sd^y of 
many realistic applications. 

One of the %st attempts to reduce die configuration over- 
bead in the cbmext of temporal pamdoning has been pre- 
i^ted In (12]. However* the approach uses the simple 
model of splitdng the available FPGA reso\iKes into two 
parts and performing lemporal partitioning using half of the 
total available area as the size constraint- The scheme only 
overlaps configuration with execution of adjoining pard- 
dons and does not take into account the pre^fetch steps thae 
can be efficiently used in some RPU architectures. Further* 
more, the approach causes problems, when some resources 
of the RPU must be shared by two or more parddons. This 
contradicts the xequiFement of disjoit spaces of the IIPU 
used by two adjacent temporal partitions. 
: ^ The temporal parddoning algorithm used in the XPP-VC 

) compiler is based on some ideas presented in (23]. The $pQ^ 

dial characterisdcs of die algoiiUim to dea! with rasoutce- 
sharing during die czeadon of die parddoi^ are not nsed 
and stpecial heuristics have been added to deal with die fetch 
and configuration dme of each partition. The purposed par- 
ddonhig of loops was lirsdy intsoduced in this paper. The 
scheme can deal with any type of loops. Hie previous ap^ 
preaches consider loop distribudon when a loop does not 
fit onto the RPU 114]. However; loops which cannot be en- 
tirely mapped onto a single coofiguradon arul which cannot 
be distributed aie not compiled. Our method can deal with 
programs with unlimited complexity as long as the sn^ 
ported C subset is used. It does not depend on the feasiUli^ 
of a specific compiler transformation. 



7 CoKiclaisioxss audi Ftatera Work 

This paper describes the new Vectorizing C Compiler* 
XPP-VC, which maps pcogcams in a C^subset extended by 
port access funcdons to PACT'S XPP architecture. Assisted 
with a fast place and rente tool, it furnishes a complete 
'^ush-button" path from algoiithmic descripdons onto XPP 
configuration data with short compiladon times. 

An innovadve temporal partidoning scbenn» is pre- 
sented. It enables the mapping of coiriplex programs and 
furnishes XPP aplicadons widi perfbrmance gains by hid- 
ing some of the configuration time. A new mechanism Co 
handle parddoning of loops, which supports loop execudon 
by die configuradon manager of the XPP« is also presented. 
Furthermore, die compiler generates self-contained config^ 
uratton data even when several configuradons are exposed. 
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Ongoing woifc focuses on mning the estimation steps to 
assist automadc temporal panidoning and on improving the 
configumtion data generated. 

In addition to loop unrolling^ hop merging, loop distn- 
button and loop tllmg will be used to improve loop ban- 
dling, i.e., enable more parallelism or better XPP usage. A 
fb ture extension of die compiler for a hosc-XPP hybrid As- 
tern is planned. The compiler will maqp suitable program 
parts, especially inner loops, to the XPP, and tho rest of the 
program to the host processor. 
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2 SPECIFICATION OF CONFIGURATIONS 



(St) can hav9 mpre impact if the map, pla<», and route phase could, as most as 
possible, to confine temporally adjacent coi^tgor^tiond in distinct locations of 
the XFP (augmenting the concuixenpy between eacecutidn and configuration). 

'FoK reconfigurabXe computing platforms, where the reconfiguration of the 
array takes several dock cydes, Is in most of the cases preferable that a. configa- 
ration is reused as much time possible in ord^ to reduce the reconfiguration 
orethead. Thus. loops in the source code are always good candidates to be 
en^rely implemented In a sin^ configuration. 

2 Specification of Configurations 

Configurations can be specified by the programmer using XPF^next^canfO 
' latatemfflits in the source code of a given application. Such etatemenis must 
0 expose on the control fiow graph (CFQ) of the procedure, reg^sna of code with 

' all entidles to the same instruction and ev^tuaHy niuitu>le eselsts: The compiler 

exi^oses the configurations, removes sudk statements firom the SUIPl[2] interme- 
diate representation^ diedcs fbr invalid apedfieations of configuration boundades 
(when tibte statements expose re^ons with entries to different statements in a 
^on of code^ or when code can be contained in more than one re^n^), insertes 
the code respondble to the data communicatiion between temporal partitions, 
and generates both the NML (Native Mapping Language) [8] representation 
of each configuration and the application section spedfying the control fiow 
of configurations. Such control fiow is orchestrated by the CM (Configuration 
Manager) of the XPP during runtime. 

Consider a painter*fi:ee version of the quandzer of an h2B3 implementatian. 
[3] whidbi code Is shown In Fig. I. Pour XPF^next_eonf() statements were 
inserted m the code to Spedfy three configorattons. The conBiguratiDns specified 
are represented in the CFG of the example that can be seen in Fig. 2. Apart 
firom specifying temporal partitions la sudi a way that the mapping to XPP 
is accomplished, there can be the case that; merging only the mostly taken 
conditional paths in the same configuration can reduce the total execution time 
by substantially reducing the reeonfigaration time (since the partitions Ibr the 
- .,, other paths are not configured when th^ are not taken). Eig- 2 presents such 

c!? ft case. If the path bb_0, bb_l and bb_2 was identified aa the most fi:equently 

executed, such path can be spedfied to be in tike same configuration^. In audi 
; a case, the configurations related to bb_3 and bb_4 wiU only be called when 

the most firequently path has not been talcen. la some ^camples, patfaa are only 
^ executed in "degug mode*^ (as is the case of the branch taken when QP evaluates 
to false in the aource code of Fig. 1). 

Afi:er escposing the configurations, the temporal partitioning phase intro- 
duces the statements needed to communicate scalar variables between two dif- 
ferent configurations (see Fig, 3 ). Cuwently, the scalar variables are stored in 

^It^l duplication could be applied in eome exeoEnpIes* 

^l^ldupKcationof bb_6 woiUdpennittoliaveacoa%uw«Ioawith{bb 0,bb_l,bb 2, 
bb_5}; saqther one with {bb_4, bb^S); aad another one with {bb^S, bb3}5 

Joao M P Cardoso^PACT Ihformationstechnologle GmbH, 
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2 SPBGmCATK)N OF C0TmGVRAl20NS 



if (UP) < 

if (H6ds ^ mOV^lVrnk 1 1 MoOfl KTOB^mnULQ) { /• laiara */ 
qeoeff C03 ■ iBmax(l,BmliL(2t64,«OQ£EC0]/8}) $ 
for <i « 1; i < H; i-w-) i 

level = (abs(coe«f Ci])) / (2*qp)} 

> 

}• else *C /* son Xzitra 
ZPP.jaext;««con±0 ; 
for Ci 0; ± < M; { 

level ^ absCeoeff ClI)*QP/2) / C24>qP>3 

qeoeffCi3 « BimliLC127«raDax(-127* «&gii(coe££[i]) i» lev«^>t 

} 

> 

> else •{ 

XPP^next.conf O ; $ 

fer (1 » Oi i < Hs •£ 
qcoeff [£3 = coeffEi]; 

> 

XPF^eXt^conf () ; 



Fl^e 1: O eausoe cxide of lihe quantisation algoxitbm with eonfiguration bound- 
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2 SPECSFIOATION OF CONFIGURATIONS 




Figure 2: CFG tif th/Si ^oritbzn shown in Fi^. 1. The lines cros^g edges 
represeitxt the XPP^ nea*^ confQ statements in the code. The bubbles contwain$ 
ba^c blodRS of the CFG cepresent the eecposed regions of the CFG th«tt ate 
implezneated in di&rent temporal partitioss. 
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3 MAPPING APPWCATIOrKWI^ 
m LOOP BODIES 





jlst conmiElils 










conf 


#1 


a = b * c; 


a = b + C5 


conf 


#1 


^sxt:_conf (} } 


'comm.[0] a a; 




#1 


d » ei/es 






#2 


V V • 


d a/e; 


eoaf 


#2 


a) 


M 


c> 





l^gure 3: i&cample illustrating the communicaUozi of lihe value of a sceJar 1/91!-^ 
able between two cozi£lgiiratdons. a) source code; b) source code with statements 
inserted to buffer the deta; c) conjSigur^^on JD for each of the statements m b). 

y arrays apedeHy ittseirted la the SUIFl r^ptesentation of the givea application. 

Tiiose azrays mapped to Intenjial memories of the XPP^ . The twporal pax^ 
titioning phase also enssure that arr«^ used by more thao one configuration, 
or by the same confi^ration loaded more than once to the XPP, ace tnnded to 
the same memory location and such locaUon Is not used !^ other 9xxs^ during 
the lifetime of the array variable^ 

The assignement of the overall esdstaat arrays (the initially used in the source 
code more the added ones to communicate data) to the mtemal memozies Is done 
based on the lifetimes of th« j^xrays determined by the segiieoce of configurations 
thai: were previoulsy e:Q>osed in the appUcation. Tlds permits, in some cases, to 
reduce the number of intesnal memories needed by time fiharmg, among difibent 
oonfigurationst some hitmal memories during tihe execution of the applieaiaQn 
on the XPP. 

The XPP-VC compiler generates, for each esdt point esdstant in each oonfig- 
utation, an event connected to one of the CM ports available in the XPP (the 
CMC can check if an event is generated and can proceed with different configu- 
rations based on the value of the event). The generated event has value "0"' if 
the path that activates ^t exit ts.takea and ''i" otherwise. 

3 Mapping Applications with Configuration Bound-- 
aries in Loop Bodies 

Configuration boundaries in loop bodies can be deal perfom^ng loop distri- 
bution (as long as it can be app lied) or by tmiporal partitioning the loop and 

3ln this case, the mteraal memories are used a? data bufiera for the mantaiaancs of the 
orlg^^al prograzQ t^efaavioc. 

«At the moment each configuration must use a number of array variables, to be ^SQisa&X 
to thaixiternal memories of the few or eqval than the raimbep of internal memories of 

the XPP (the compiler aligns each atrjy to a distinct memory). Howevw, the total twmber 
f^azzaara esEistanl: on ovezaU confisuratxon^ can ^utpasg the number of internal memories, 
ff some memorlea can be shared between configurations due to the non-overlap of the lifetime 
of array variableB, The data stored in memories Is maatainsd across rscon^uratigiis. 
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5 MAPPmGAJPPnCATlONSWPmCONnGXJiUTXONBOUm 
WLOOP BODIES • 



inl; It 












coni 


#1 




labl: ifCKN) { 




#2 


si:at«iieiLtl^ 




couf 


i*2 




conf 


#3 








#3 


> 


goto labl; 


coxlC 


#3 






COXKf 


«3 




statemeaiiS; 






0) 


b) 


c) 





Figure ^ Exampl9 of the transfiaimatiOA applied to loops with configuradon 
'^^) boundaties in their bodies, a) odgziial aenaxee code; b) ticansformed code; c) 

T configtoa^nXD fbr esi^ statement in b). 

use the CM to ocekestrate the control flow. 

Currently, loop difiitribution [ll] is not automatically applied^. All the loops 
with eonfiguratton boundaries spedfied in th^ bodies are trattsformed into 
iff) goto label land loops in order to permit the NML generation by the XPP- 
VC compiler. Fig- 4 show^ an eseample of such trandformafdon without the 
statements needed to commumcate the value of scalar variables betwe^ cionfig- 
urations. The 3rd column shows the configuration ID of each statezoent. Each 
configuration reque^ the next configuration to be taken (if the eadt taken is to 
the end then only the ^'reconf' ^ of the conflgoratloa is done). Gonf #2 needs a 
conditional request medianisia to call conf #3 or conf #4 based on the -value 
the i<N' expressioA. Since conf #3 alwa^ requests, at the aid of its exe- 
cution, conf« #2, the initial b^vior of the loop is maintahied. The ten^oral 
partitioning task also creates two more configuration boundaries to preserve the 
initial functionallity. EVom Fig. 4b} can be seen that configuration boundaries 
were inserted before and after the 1/ statement. Such boundaries are needed 
since the code before and after will be escecuted once and both the if header and 
body will iterate N-hl and N times respectively. 

•^S^ The configuration boundaries inserted in loop bodies mudt specify, at the 

scope of the loop body, the permitted type of regions (alxea^ eaeplained). 

/ Loop distribution (also known as ^oop fis^os^ will be the preferable fbtm to 

implem^ loops, whi^ generated NMIf does not entirely ^t in the available re- 
* sources of the XPP. Such transformation can potentially lead to the introduction 
of temporary array$. Consider the loop shown in Fig. 5 where a configuration 
boundary is specified. The loop can be splitted so that the two statements axe 
each one in one loop and the configuration boujcidaxy is now outside any loop 
body. Eowmex^ we need to sca lar eaqpand variable s in order to mantain the 

^^Tb» eompUer should dieek if the loop distributifia caa be ^plied on each temporal pav- 
tition bonsdcuy eadfitant in loop bodies. 

leednf means that the tesouscea used that GonQguratiiui ere reteased and then can 
be reoonfigiired. 
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4 gXECPTTOJy STRATEGIC 



d » «». // statement 1 
s fi // statement 2 




forCi"0; i<M; i***-) { 



statement 1 



XPP jaeA^conf C) ; 




statement: 2 



! 



Figure 5; Applying loop distcibuitioa as another "way to enable tempore pacti- 
tiooing on loop bodies. 

initial function^llity (one array with tiie size of the number of iterations of the 
loop tnu&t be debated and is used Co communicate each s value in each of the 
loop* iterations)' 



To reduce the overal latent, efGlcient exploitation of the pipelining of the 3 steps 
presently in each temporal partition (fetdiingj configunng, and array eseecution} 
must be conducted. 

Two "reconf^ mod^ can be used (the u$er can select one of the mode? in the 
Options of the XPP-VC compiler related to temporal partitioi^}: 

• '^eeonf executed by the CM. In this case eadli ccmfigaratxon commnni^ 
cate« wiOi the CM sending an event, when the completion of execution, 
to request the next configuration. This next configuration starts by exe- 
cuting a *hreoonf" command to an XPP resource of the coni&guration (that 
command is broadcasted throughout aU the resources used by the conSg- 
uration, and so the resources will be released and can be reconfigured by 
the next configuration). When a configuration can be requested by more 
than one previous configuration, special configuradona are inserted in the 
l^poral Partition Control Flow Graph (TPCFG^) between eat^ source 
and the sink. Such special configuraUoss only command the '^recoof of 
a resource in the XPP of the previous configuration and request the next 
one. This lype of ^econf does not pennit to have overlapping between 
execulaon and configuration between tempcxral partitions; 

• **reconr self applied by each configuratidn. In this case each configuration 
at the end of tiie execution broadcasts a "reconf ' event to aU the XPP 

^The TP0F6 is a directed, eventually cyclie» ^aph wh6r« eadi nods represents a configb 
UJcatSott (temporal partition) and ^d^a betwEen two nodes apecito the execution flov? of 
the ^ppUcatxOft throuj^h its temporal partitiona. There £9 only one ed^ between two nodes of 
the graph and eadi node r^rooents a region of the CFG of the application. 
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5 APTOACAgarC TEMPOHAL PARrmONING 



resources periencing to it* This mode <^oes not ad4ition of special 
coufigura^ttB by the compiler and pemiita that the CM ttj to configure 
the next eonfiguration called dozing the exBcution of tihe cailee temporal 
paxtition (when only one confilguration pa^ is presented)^ 

The compiler also generates NML code condidertxiig the pre-fetdi Qoad of a 
configtiration to the cadie of the XPP) o£ codfigutatidiks. When the pre-^stch ie 
< enabled two strate^es can elso be automaticaiiy used: 

• request of tdie pte-fetch of all configurations existant in the application in 
the start of the escecution (duzing tHs pre-fetdung the fiOfw of configuration 
and execution is done in the -way it is 8peci£ed in the appHcation section 
of the NML file); 

• request in each coxifiguration of tiie pre-£etch of tlie next. The reque^ is 
done before the start of the configuration step for the current configurs^ 
tion. 

The CM of the XPP p«rrmts also speculative configuration of a temporal parti- 
tion that can conduct to bett^ peiformance restilts even when the Map, Place 
and Route does not try to locate temporal partitions in non-overlapping areas 
of the XPP. The strategy tries to configure the partition speculatxvelly used 
after ^e eonfilgutation of iJie currrat one. If the jiath wlddi includes that coit' 
figuration is taken, the CM only has to enable the start of the execution of tlie 
configuration (see the section of the NML code in Fig. 6 and the simulation 
results in Fig. 7, where cQnf_M0D2 is speculatively configured during the ex- 
ecution of Gonf_MODO). When such path I3 not takeA* the CM releases tiie 
resources alrea^ config^ited and requests the other configuration. 

5 Automatic Temporal Partitioning 

Automatic temporal partitioning permits the automatic eseposix^ of configura- 
tions oriented 1^ two distinct goate: 

• • minimum number of configurations: this goal can be achieved with al- 

gorithms that try to use all the available reconfigural^e processing unite 
} durmg the asdgnement of eegmente of behavioral code to the same con- 

figuration; 

• minimtnn overal latency: this goal can be adiieved by considering tiie costs 
to load into the caches to coz^^m and to execute each configurajtion with 
the XPP array. An important strategy tiiat must ba con^dered is the use 
of pre-fetdk of configwations while one of the othecs is running. Arr^ 
<^ constants or with pre-defined values used in one or more configurations 
can be initialized in one of the previous configurations if such one exists. 
This takes advantage of the ini^alizalion of the array carried out by using 
the configuratibn bus> 
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5 AVTOMATXC TEMPORAL PABTmOimO 



C0I9FIG contf J!0DO { 

C0NF.M0DUL:e(M0DO) // reqaost the eonf igoratiiozL of HODO 

SBTQtODQ-li^oxkf ,B ^ 1} // enabXo the self r^easjU^ of resowcea 
REQVESTCeonf JfOOa^pec) // spaeaXa^lvo cottf igoratlon 
// (MQDO.CMPortO then cowejraD2^Bxec is reqi&ested 

// else continw 

// If CMDDO.CMPort:! =^ •'O'O -felieaL coiif J!0D1. is requested 
C0NF.CMP0RT(H0DO.CB!PortO, CQzif^HQ02.euc, ^ // take H0D2 oar coMinofi 
C0t9F^Cl!PIfiirCH0]>0«CKPortl, confJUQDl, ^ // take HODl 



> 

CONFIG Gonf J!K)02^exec { 
S5X(HQD2.Start.A ^ 1) 
SET(KDD2.Hecezif ,H « 1) 
BEQUBST(cozif^0D3} 

> 

CONFIG craf ^HODl -C 

ftEQUEST Cconf «K0d2 jrec) 

SETCHODLBecoaf .£ - 1} 
R£QtIEST(coB£.HQl}3) 

> 

COKPXG coaf ^0D2^ec { 
RECONF (HODl . Start) 

} 



// request tEie COnfigciratioa of H0D2 
// MQD2 is taken 

// enable tlie start of computing of MQD2 
// enable the self release of resourced 
// requd&t tSie next configuration 

// MODi is taken 

// request tbo reXeasing of resources 
// recjuest the MODI 

// enable the self releasitxg of resources 
// request tlie next coufijgaraCion 



// release the resources' of MODI 



Figure 6: Example of a flection of NML code describing the speculative config- 
uration concept. 
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5 AUTOMATIC imiPORAL PAiomoNma 
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Figure 7: Example Of titte on^lappx»g among ifa« lefedti, configuration, and eaee- 
CuUoA steps of diEecent terapocat paxtitlona. 



firom tiie SUIFI representation of the C source code the temporal partitioning 
phase constructs sn esdended BTG (Hierarchical Iksk Graph)^* Such extended 
graj^ has 2 types of nodes; 

, ■ 
X. b^avioral xiodes x^zesentmg eouxce lines of code in the fi^ut program; 

2. array nodes t^presenting each attagr existent in the source code. 
Type (1) nodes have 3 ^stinct sub-types: 

1. blodk nodes representing basic blodcs wl^th one*en^ and a shxgle exit; 

2. compouzul nO(tes representing x&then-^$0 strocture?; 

3. loop nodes representing the loops (for, while, eet.). Loop and c<»3Q^ound 
nodes explicitly en^body hierarchical teveb. 

Edge$ in the HTQ-f r^resent data communication between two nodes or just 
enforce execution's precedence. 

Bach b^avioral node of the HT6+ is labded wit^ the following information 
(some of the labeDing steps requtee estimation e&rts): 

• block and oompoundl nodes: number of ALUs and XtEGa; 



^Tbe model ^ beoa chosen, hecsuse It win also permit to exploit loc^ and ta& level 
pazalelisizi. 
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6 APTOMATIC TEadg>OJUL P4fiTmOMZVO 



; 




Ftsace 8: Top level of the HTG+ for the DCT example (fhi9 to^p levd consists 
of 4 loops). Circles and boxes zqiresent behaviotal and array nodes respectively. 
Loop 1 reads the datai^treani ficom an iiqiut pact to an fattoiiBl memory md 
I>)QP 4 mites the datsr^stream generated ly the DCnr eod« 
an mtemat nmooxy to an output port of the XPP. 



• lo^nodffi: imnber of Iteratjona (vdnormi if unbound), and nmnber of 
ALUs and iSEGs; 

. array nodes: the slie of the array, type of the dements, and, when they 
do eiost, the initializsti<m values. 

Eadi edge between two btibsviual nodes of the 1CTG+ is labeled with the 
nuxribw of data words that must be traosfiBtted between the two nodes 

^1?^ ^ * behavioral node in the HTG+ is labeled 

wiA Uie number of load and store references (.„ = A[iJ and AJiJ = ... respect 
tively) m the source code represented by the behavioral node to 4at partic^ 
array. The^tur»ated number of times that each load and stoio rBf<^!nce win 
be e=tecuted la alto collected. Suqh infonnadon is used to calculate the penalty- 
when two or mote behavioral nodes are merged into the same temporal parti- 
turn. Such P^ty «s related to the use of «ie same array by differ^ bdiav^oral 
nodes and adds an overhead to the esscution latenqy of that temporal partitioa 
and to the mmiber of lesoorees needed fhr its implementation. 

miS;fSS^*'£'*? for an hnplementation of the DCT 

Plscr^e Co^ Tiransfcrm) based on matrix multiplieaUons, 

The automatic temporal partit&ming phase needs 3 types of estbnatlcms: 
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lY^t dans [fteeze TPt and the 3s»ddl9d nted 
Figure 9: AutcanaUc Ibmpoxal Partitioning methodology. 

• number of XPP source tinits needed by the configuraidon implementii^ 
a single or a 9et of behavior nodes; 

• latency for a behavior node or a set of connected behavior nodes on the 

(this doeo not need to be accurate to the zeal execu^on time and 
only needs to have rdativeness accuracy); 

• number of dock cydes to fetdi and configure each temporal partition 
(calculated based on the number of config;uzation wotde needed^). 

The temporal partitioning strategy does not exploit configuration boundariea 
inside loop bodies> unless the entire loop cannot be mapped to the XPP- the 
generation of this lype of temp oral partitions nefver produces better results 

Opaa ba ^Immted by the nuoiber of ed^ses, ALU nodes, REG BOde^* and pre^tefio^ values 
oB^twit m €ha faaxdwaze graph gsttazated by the XPP^VO coapiler. 
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g DECUSSION 



s (at le33t when the loop b^vior is exissured hy the OM). Thq Jn^tfScation i$ 
supported by the leutitisB^-lion of x^esources^ already configured, achieved when 
the entire loop is ipaplraiented by single configuration- When a loop does not 
fit in the XPF, the algorithm is applied Merardbicaly to the body of the loop. 

Fig. 9 shows the methodology. The strategy works axoond 3 lovols (tiie 
computational efforts increase &om the first to the thii^d level): 

a. Temporal Partitioning algorithm based on the estimation of the needed 
resources done with fixnction costs based on the number and Mnd of op^ 
eratlons in the source code. The al^dtbm vtBes the HT6-K and the SXJIF 
xeDresentatioRoffhe program; 

2. Fbr each configuration, selected in the first levels the estimated si^es are 
T) chedced with the ones estimated by generating tiie NML graph with the 

XPP-VC eomi^Gi?. If the rize surpassed the available resources, the al« 
gorithm rerun level one, xela:dng the size constraint (diminiiiahlns the 
maadmum number of available resources); 

3. Check if each configuzaMon successfully chewed in level d can be really 
mapped to the XPP<^ This level uses fbnctlons of the mapper, placer and 
zQuter. If the configuration cannot be Implemented in the XPP. the algo- 
rithm returns to level 1, once more relaxing the ^ze constraint. 

The temporal partitioning algorithm used is based on the ideas presented in 
(7], The special characteristics of the algoji^thm to deal with resource-shanng 
, during the creation of the ten^oral partitions have been removed and speaal 

heuristics have been added to deal with the &tch and configuration time of each 
temporal partition. The algorithm tries to overlap configuration and execution 
with fetch during the sdecticm of the HTO-I- nodes to each temporal partition, 
{describe more] 

6 Discussion 

4-. We call the attention of the reader for the fact that the current methodology 

does not use neither the ML potentialities of the XPP nor some optlmiasatlons: 

' L The execution of a given temporal partition only starts after an the used 

resources have been configured; 

2. No pip^hiing b etwe en &tdi and configuration has been used^ The config-* 
uration of the XPP resource? for a specific 1;emporal partition only starts 
after its configuration words ere fetched (loaded to the XPP cache); 

No overlapping on execution between two or more configurations; 

4. The cs^adty of the XPP technology to configure concurrently distinct 
PAOs (each PAC has its Own. 
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S, An arbitory order for fetching of temporal partitions con<JitionaIly re- 
quested is issed (the fetchiixig order should be done by the most tak^ path 
detecmSned by profilmg); 

6- Behavioral nodes exposed an the HTG4- as concurrent uiodes are not at the 
monuent implemented, by the XPP-VO compter, -with parallel execution. 

ThuS| we strong believe that there stUl be potential to improve the performance 
resulto adueved whm using TQPF^VC. 

7 Related Work 

^* ♦ . 

The XPP technology oiBfers an uni<iue*reconfigurable computang platform sup- 
f ported by tools that permit to compile algorithms in C. B^g a step fbrwavd 

^' in the conte^^t of the reconfigurable osmputing it pennitd to attack «ome of the 

well defidencied presently in many, if not all, other zeconfilgurable computing 
technolo^es. However, some of the work bdng done to aygr^gn it the potential 
of Sttdk technology has soixrces in some works previous]^ done. 

TnnDOral partitioning has been already successfully conducted for FPGAe 
and other type of HPUs. The majoniy of the current approaches try to use a 
minimum number of configurations by using all the pos^ble HPU size available 
£qt each temporal partition (see, for instance, [4]). Such schemes only consider 
another temporal partition after current one has fulfilled the available re- 
sources and are insensible to the optimization that must be applied to ^reduce 
the ovsraS. execution by overlapping the fetching, configuration axid execntxon 
steps. Albeit not con^dering such opthnizations> XLP Ibrmulations presented 
by some authors axe uncapable to deal wifk ^ complexity of many realise 
essmples. 

One of the first attempts to reduce the configuration overhead in the context 
of temporal partitioning has been presented in (6). However, the approach uses 
the simple model of splitting the available PPGA re$ources into two parts and 
performing temporal partitioning using half of the total available area as the size 
constraint. The scheme only overlaps configuration with ejcecution of adjcmung 
1? partitions and does not enter into account to the pre-fetch steps that can be 

efficiently used in some RPU architectures. F\^hermore, the approadi can 
) ori^ate some problems, when some resources of tiie RPU must be shared by 

two or more partitions (diminating the requirement of dlsjoit spaces 6f the HPU 
used by two adjacent temporal partitions}, 

(I2]pres^ts the scheduluig of loemds (sub-tasks) targeting the Mon>hosys 
architecture. They use sen ^d^t seardi prunning sdieme added to an heuris- 
tic that permits to consider firstly solutions which potentially conduct to the 
best performance results. However, they mainly orient the search to data re-use 
among the schedul kernels which is only suitable to type of reconfigurable com- 
puting architectures where no local memqiries to the KPU are available. The 
scheduler tries to overlap connputing and data transfbre and xmniml^ context 
reloading^ whi(^ as we can see Bsoxa the eaeamples shown can not always conduct 
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tp the overall minimiim latency. The sdieme needs as inptit the applicatiozx flow- 
graph (without concurrency and conditaional paths) and the hsau^ timing. "Xbe 
approach does not consider teEnp<vel paridti<muig and so needs that each Isesmd 
configuraMon does not eaoceed the context memozy slse- 
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Temporal Partitioning for the XPP-VC 
Compiler 

i&th January 
1 Benefits of Temporal Parbitioniog 

Tsmpotal pattitioniog cao W» a dxstinct and imponaat goal thaa to fiimole 
enable the compUation of algoritbzns wWch th^j mapping onto the RPU fRe- 
cottflgurable recessing Unit) resources cannot be accomplished by only on© 
configuratSoa. For instance, temporal partitioning targeting the XPP fl] cm 
reduce, when effideirtly applied, the orverall execution latency. Suqh reduction 
can be mainly enabled by the following issues; 

• reductiott of the interconnectiott lengths, by reducing each design com- 
plexity, can furnish better performance results (long interconnections pass 
through registers and thus adding dock cycle del£^}; 

• reduction of each temporal partition compleadl^ can itself reduce the num* 
ber of regSstet9 used £br ver^cal xoulang; « 

• reduction of the number of rderences, in eajrfi temporal partition, using 
the same resource, by distributhig the ovwaU references among temporal 
partitions, can furnish better perfonnaace results as toIL Hils happens 
with the statements presented in the program referring the same array; * 

• reduction of the overall configuration overhead by overlapping fetching, 
configuration and eicecuMon of disthxet temporal partitions. 

The reduction of the configuration overhead id due to 3 distincts sources of 
overlapping, possible with the XPP architecture: 

1. loading of subsequent configurations into the cache in parallel with the 
configuration of the current (me; 

2. eacecuticm of one cottfiguratioa while (he next one is b^g configured; 

3. execution of one configuration while the next one is behig loaded into the 
cache. 
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Abstract 

iieffource vtmalizaHott on FfQA devices, ae^ievable 
clue to (ts ifynamic reaatfiguration a^a&iHdes, pro^de^ 
em attractive solution to save silicon ar^a. Architectural 
^nthssis for dynamitiaUy rscor^gurable FPGA-4>(i$ed 
digital systems fields to consid^ the case ofr(sducin§ the 
ntmber of t^poral partitions (reconfi^ations), by 
enabling sharing of some junctional units in the same 
tempore} partition. This paper proposes a novel atgoriihm 
for automdted daiapath design, from behavforat input 
descriptions (repressnted by a data/low ^aph), which 
simultaneously performs temporal partitioning and 
sharing of functional units. The proposed algorithm 
attempts to minimize both the number of temporal 
partitions and the execution latency of the generated 
solution. Tenyporal partitioning, resource sharing, 
scheduling and a simple Jbrm ofallacatiOH and bindh^ 
are all integrated in a single tash The algorithm is based 
on heuristics and on a new concept eot^trucHon by 
gradually enlarging timir^ slots. Restdts show the 
efficiency and eJJTectiveness of the afgoritkm when 
compared to existent approax^kes. 

1 Introduction 

The avaDaWHty of multi-pjxjgraniniable logic device© 
(such is the case of E^GAs - field progranmble gate ar- 
rays) with lower reconfiguratioA times has made possible 
Ae concept of ''virtual hardware'* [IIP]; the hardware re- 
sources m sup3>osed unlimicecl and unpleineatatiotts tiiat 
oversize thd resources available on the device are lasoWed 
by temporal partitioflii^. Th^, the teixiporai partitioned 
solution is executed lay ttme-shariiig the device such-that 
the ioitCal fUnctiooaUiy U preserved. This concept prom- 
ises to be an efBcient solwHon to save silicon area [IJ. One 
of the applications Is the switch among functionalities diat 
have mutual exclusiveness on the temporal Aottmn^ such 
as flte GOnte?ct-switching between coding^decoding 
schemes in coMnnunication, video or audio systems* 



Although, even the latest commercial FPGAs^ stich as 
fte Xilinx^ Vinex family [3], do not have mei^iajyisnis to 
feaplemem efficiently temporal panitiox»ed fuactionaliUes 
and the time of itconfiguiatlon of the ovendl FPGA is still 
quite high, the importance of the "virtual haidware" con- 
cept has already been demonstrated with computationally 
complex applications [4]. Indnshial efforts are under way 
to further improve the capability of the devices to handle 
multiple^onfigurations by storing several on-chip con- 
figuratiorts and permitting the switch between contexts in 
few nanoseconds [5}. 

The virtualization of FPGA resources has been consid- 
ered by several authors while dealing with citcult netlista 
diat oversize the available resources on the device (I^tTJ. 
just to name a few). From the point of view of ii» design! 
tihose approaches wotk at a much low-level of i^traetion. 
Without the possibility to exploh tradeofft between the 
number of reconfigma^tons and the resource Sharing of 
functional units (FUs), for instance. The design automation 
for FPOA-based ^sterns should include temporal parti- 
tioning algorithms able to ^cl^tly exploit the new con- 
cept Tradeoffs among parallelism* communication costs, 
execution and rcconfifguration hmes, and sharing of some 
FUs in the same reconfiguration need to be considered 
during the architectural synthesis phases. 

Sharing of FUs among operations is a technique to re- 
use a single configuration of an FU by more than one cp- 
eratiou of ^e same Q^e. On the other hand, temporal par- 
titioning is a technique tailored to reuse the available re- 
sources by different circuits (configurations) with the timo- 
multiplex of the device. The nodes of a given intemiedfete 
representation (e-g., a dataflow graph) i^resenting opera- 
tions have to be scheduled hi time steps to be executed in 
each temporal partition (T?). Temporal partitioning must- 
preserve the dependencies among nodes (that are already 
temporal dependencies) such that a node B dependent pn" 
node A cannot be mapped to a partition executed before 
the partition where node A is mapped. In addition, consid- 
mng sharing FUs during temporal partitioning can con- 
duct to better overall results (lower number of TPs and 
better perfbrmance)* 
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Figure la) $how$ a design Boyr which integrates osmpo- 
rat partitioning prior to the high-level syndesis tasks [8], 
The majority, if not all, of tfte existent approaches utilizes 
the presented flow [PJCIOJ. Our efforts address arxshitec- 
tuKd synQiesis* integrating temporal parttcioning and this 
paper presents-a new temporal partitioning algorithm that 
effectively tgdces into s^ccount sharing of FtJs, while main* 
taining a small computational complexity. Besides*, it is 
SuiHciently flexible to target di£^z«nt FPOA devices. 
Hgore lb> shows the design How proposed in ftls paper, 
where temporal paititfonhig is tnt^prated in the high-lev^ 
synthesis ts^ks and is performed simultaneously. 
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a) b) 

Rgure 1. Design flow based on high-level 
synthesSs for reconffgurable wmputing 
systems: a) traditional floin^ b} protiosed 
flow. 

Exami^^e 1. Motivational es^mple. 

Consider the dataflow graph exhibited in Figure 2 (ExI). It 
consists of 4 additions and 2 multipHcations. Suppose that 
each adder uses 1 cell and has a latency of I clock cycla 
each multiplier uses 2 cells and has a latency of 2 dock 
cycles and the maximum resources available on the device 
equals 3 cells. The dataflow graph has a entical path la^ 
tency of 4 cycles and needs 8 cells given those FUs (last 
tovf of Table I). Figure 2 shows an optimal solution (not 
considering the area of multiplexei^ registers and control 
unit needed to Impi«nent sharing of a specific FU) for the 
example wtdi results shown in the second tow of Table L 
In Figure 2 each gray region identifies operations that are 
mapped to &e same FU. The cptimal solution is achieved 
with only one adder and one multiplier and fits totally on a 
single TP, When not considering sharing of adders* the op- 
timum result is shown in the third row of Table L The al- 
gorithm proposed in this paper achieves those optimal re-. 

, /JS®*^ \^ distfi^da among the lemis: A«/».A«ve/^n<Aesif, atx^h 



suits. The fourth row of the table shows the solution ob- 
tained when considering a leveling temporal partidoning 
algorithm that does not consider resource sharing of FUs- 
From this exmnple, it can be seen that resource sharing can 
reduce the number of reconflguradons and can also i^ce 
the overall execution latency. Hxere are also cases whete 
the critical path latency of the input dataflow aaph (last 
row) is maintained (second rowX ^ v 




Figure 2. Dataflow graph of the example 



Table I. Results for Ext. 
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^I^T^^/*^*^ P*?^ organized as foDows. 
Section 2 fotmulates and ocplains the prablem. The alaj- 
nihm B deeply explained in s^on 3. where the pseudo- 
code Md die overall perfonaed steps are fully elucidated 
tooush an example. In section 4 og>erimental lesnlts are 
*own a»i discussed, fa section 5, related worit is dc 
scribed. Finally, in section 6, eondustoas axe pieseaied 
and ftjrther work is envisaged. 

2 Problem Definition 

OJven a dataflow graph (DFG), representing a beliav- 
imal descripdpn, G = (V. B), topologicany otdeted. di- 
i^d and acycUc, with M nodei^ {v,.Vj....,Vh»} and |EI 
edges, where each sode Vi {^presents m operttion and 
wch edge e E rapresent^ a dependence between nodes 
V, ai«i V,. A dqwndenoe can be a simple precedence- 
dependence or a transport-dependence due to the transport 
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of data between two nodes- The DFG can bo obtabied fiom 
an algorithmic inpvt description. Such pre-processing step 
is beyond Ac scope of this article, but the front-and of our 
Java compUer for reconfigmable computing systems can 
bee«5>li^[l2]. „^ ..^ 

Hart w aasuniB that thei» is a component hbiaiy with a 
set of FUs and'tbeie is one FU for each type of operation 
in the DFG. O represents Oio set of FUs, fiom the compo- 
nent Hbiaiy, to be histantrated by ihe algonthiM. Kmax 
tepiesents the resource capacity avcdlable on die device. 
K(ro) returns the number of respwces utilized by thoTPiii 
and RfvO returns the number of resources utiUzed by the 
HJ instance associated with vj. N<7ti) remms a subset of 
nodes of V mapped to 1^ 

Each pardtion iik is a non-empty subset of V, where for 
each node exists a map to one and only one FU instance m 
ci>. 4t(vO identifies the TP whejre node V| is mapped. The set 
of &e TPs is represented by: 



subset of nodes in ?q and the correspondent edges, consid- 
ering that nodes sharing FU instances oan exist), di and d 
reps^ent die number of clock cycles to leconfigurB die TP 
iq or all the available resources respectivdy. 

(1) 



(3) 



whei6>l represents the number of TPs. A graph G, tempo- 
ral partitioned in N subsets (JPs). is correct if; 

- QiV^;ri; = 0 reach nod© Vi 6 Vis mapped to only 

^e TP (here we do not consider cloning of operations 
intheDFG); 

_ QiVyiCi^ =4 F : all the nodes of V are majiped; 

- vniS #7,R(fld5RMAx: eachTPfitslniheresourees 
available on die device; 

- V eg 6 E, Ji(vi) ^ 75(Vj>: &e order of the execution of 
the TPs does not violate the dependencies among op- 
erations of die DFG (necessary condition to obtain the 
same functionality). 

A correct set of TPs guarantees the same overall brev- 
ier of the original graph (when executed froni 1 to N and 
considering a correct connnunication mechanism to trans- 
fer data among TPs). However, we are also interested on 
the minimization of the overall execution latency. The cost 
that reflects the overall execution latency in a time- 
xnulttplexed device can be estimated by the equation (1) or 
(2), when partial or fbll zeoonfiguradion of die ava^ble re^ 
sornccs is considered respectively. CS(#7) returns the 
minimum execution latency (number of control steps or 
clock cycles) of the partitioned solution, CS(7(i) refers to 
the minimum execution latency of the TP (it may in- 
clude the commimication costs and represents the execu- 
tion latency of die critic^ path of die gn^h formed by the 



C5(ife?)=|^CS(7C,)+a, 



(2) 



The objective of our algorithm is to ftoustb a set of 
datapaths that will be executed in Sequence with a mini- 
mum number of control steps^ Each datapath unit fits on 
the physically available resources. For the sake of mini- 
mizing the number of TPs needed> exploiting sharing of 
FUs while doing temporal partitioning needs to be consid- 
ered by die algoridnn. Specifically, our algorithm has to 
output: 

- The set of TPs (p): each TP idcndftring th© nodes of 
the DFO assigned to it; 

- The set of ittstsoiccs for each FU used (4>); 

^ Each node of the DFG has te> identify a spedDElc FU in- 
stance of ^ implementing the operadoA. 

Fiom those outputs, it is strai^tforward to generate a 
behavioral HDL-RTL (hardware description language at 
the register transfer level) description of each TP control 
unit and a stcucmral HDL-RTL description of each 
danip^h, considering the existence of a HDL description 
^ each FU. The conftguradons can be generated fiom 
diose nettists using a traditional FPGA design Sow. 

3 A](gorithm Simaltaneously :&£ploiting 
Temporal Paxtitioning and Shartag of FUs 

The algorithm uses an initial number of TPs that can be 
specified by the user. Another possibility is to use the 
number of levels of the DFG or the number of TPs utilized 
by any temporal partitioning algorithm without usmg shar- 
ing of FUs (e,g^ ASAP [U]) as the initial number of TPs. 
The user has to specify die total number of availalde re- 
sotvces on the device. In ad^tion^ for eadh FU Ihere exists 
a boolean variable which value indicates if ^e ttJ can be 
shared or not (sharing of some FUs may need more re- 
sources than the utilization of several FU instances, due to 
the overhead of using auxiliary circuits needed for the im- 
plementation of the sharing mechanism). 

To a clear description^ we show the mam steps of the 
algoridmi wiOi a connection to Example 1. A brief exposi- 



* We assume rtuxt each coi«fo!/limt swp for schcduHtig is equat ^ lh« 
dock ^riad of thft system. Thus, merer is no dist!nC<Qa amoiig th« v$e of 
, dock cyctet CQAtroI &sp ov t&ne ^lep. 
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tion of the step» perfonned, when eoiiisiderisjg shariftg of 
dll FUs, is stretched in Figure 3. 



Node 

A 1 2 3 ♦ * 


@» 


s&e^ 0 0 t 0 2 s 

alap I 1 0 a 3 
a) 

i- Ulit 

C) 






r» 5 Mit 

d) 



















Figure 3. Algonthm execution through an 
example: a) ASAP and ALAP start times; h) 
Tho nodes in the critical path Identified by 
Uie gray region; c), d), e), f) sind g) show It- 
erations of the algorithm* 



The algorithm starts v^Ui the followixig steps: 

1. Compute the set of nodes child' of each node of the 
DFO; 

2- an FU Instance to each operation in the DFG (at 
the moment neither consider more than one FU for the 
same operation nor FUs capable to Implement more 
than one operation); 

3. Estimate iheatea and execution latency ofeaoh node in 
the DFG according to the FU chaiacteilzation, existent 
in file component library, for the target device. This 
step is b^ond the 9<^ope of this article and from now 
on we will assume that there exists, for each FU, an es* 
timation of the number of resources and of tihe execu- 
tion latency; 



^ A node \i is chiki of ft ao<l9^ if fhere «ad»J a pAih fiwm ^ to the «nd 
Of the OFG that includes V;, 



4. perform the ASAP (as soon as possible) and ALAP (as 
late a& possible) start times for each node in the DFO 
(see Figure 3a)X both imconstrained. When doing the 
ALAP scheme, the algorithm also calculatss Ihe ALAP 
level of each node; 

5. Betermine the set of nodes in one df the eriticat paths 
of the PFG (see Figure 3b)); 

6. Create a number of TPs equal to the input number 
^ecified (see the three TPs initially created in Figure 
3c)); 

7. Assign each node of the set of nodes in one of the criti- 
cal paths of the DFG (determined in point 5) to a TP by 
ascendinglevel. When the number of TPs is larger than 
the number of nodes in the critical path, the last TPs 
are left emp^; o&erwise the last nodes of the set are 
left unassii^ed (see the nodes assi^oed to each TP in 
Figure 3c)); 

$. Assi^thesvise(numberof resources used) of a node in 
aTP to the current size of that TP (see Figure 3c)). 

After the above steps the main kernel of the algorithm 
is exeeuted (see the pseudo-code in Figure 4, Figure 5 and 
Figure 6). Some of die most important fimcdons used by 
the algorithm are listed and briefly explained below: 

-* vi.ALAPiev«0- returns die levd of vj considering an 
ALAP leveling scheme; 

- viALAPStattO: returns the ALAP start time of i^; 
iii.add£l(Vi): adds the node Vi to lib 

— ni-.nnEl(vi): removes vi from iq; 

— ic|.sched(vi}:refauns the number ofeontrol steps of the 
critical path considering that Vf is m^ped to irg 

- pAi^if^i adds a new TP ti^ to the current set of TPs 
(7Q will be^ last TP 01 the set); 

- p.elAt(i): returns the i* TP from ttie set of TF& (p)\ 

— JindNo4es{i}: returns a list of nodes reaify to be 
mapped to the i* TP. 

Our algorithm will be progressively oottstrucdns a 
global soludom On each iteratton, ihe algorithm traverses 
ihc sequence of the existent TPs trying to assign ready 
nodes to each TP. Bach TP has an associated maximum 
slot time (MAXcs)- A node ready to be mapped to a TP is 
only really considered for mapping if the resultant execu- 
tion latency of that TP (considering the mapping) does not 
exceed the correspondent MAX^s (line 15 of Figure 4 and 
lines 2,21 and29of Figttre5).MAXcsof»giveaTP7Ciiis 
equH to the critical path latency of that TP added 1^ a re- 
lax amount: CS(f^) + relax. On each iteration overtheTPs 
die relax value is mcremented by the great common divi- 
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sor (gcd) amox)s all the execution latencies of tie opera- 
tions in the DFG (line 24 of Figure 4), When 4 node is 
mapped (see ftncllon mapNoie in Figure 6X»ihe critical 
path length of the associated TP is updated Olnes 4 and S 
of Figure 6). 

The algorithtckisonsiders that nodes in contiguous time 
steps mapped to the same TP and with the same operation 
should be bound to the same FU instance. 

A list of nodes ready for mapping to a cunent TP i$ 
used. The list has the node^ sorted by increasing ALAP 
start timea (the candidate operation having the lea$t ALAF 
value will have the highest priority) and, for node^ with 
the same ALAP start time, it uses the ASAP start time aa a 
liebteak 0^ ascen^g or descending order). The list is de^ 
termined examining for a given node its predecessors (diey 
alreac^ must be mapped in TPs before the cunent TlEO and 
the child set (the nodes diOd of the node to be mapped 
must be on TPs after the TP under eonsideratioa). The in* 
elemental update of the list of the nodes c^didate to be 
mapped to the current TP, when each node Is mapped^ is 
an option of flie algorithm {lines 6 and 7 in Figure 6). 
When such option is disabled the: algorithm only rries to do 
update when the list is empty. Th$ algorithm uses a static^ 
based approach in the sense that the ALAP/ASAP values 
are calculated ot^y once and they are no moie time up- 
dated. 

X. // ]»e94.a main kd^neX 

5. Bit^^t: NodesSched • raarleed wicb Cite iRoaea al-> 
xeai^ napped bo TPs; 

3. int viwrp B Or srelax - 0; 

4. dan atep a $ed(AXl nodes xxi ; 

9. whi2a(ilOtAllNQdesSchQd(^eaeeSeh«d)} { 

6. ZiOOP Bs wbiX*<SchedKuni < niaxPatfeltici&e> { 

7. veefcor liscReady » £ind^oded (NuftfTF) i 
a. vector At - ^.OlAttKOnT^) ; 

9. whlle(elistReady.iaenq^tsyO) { 

10. Node vk = li&tlleady*roiFlr8i:(} r 

X2. BOOXean fit » <IW Hmax) ; 

13. // cst^ti) whcr\ is mapped to »i: 

14. cSsM «r TCi.&chedCvu) f 

15. jL£((cSn«« > (CS '►iralax) ) && (Cffi'sLs 
the Xast TP> <K<«i) 0) II 

CVk.AlAP2«,/*l() < SS(Vk))}) { 

cs(ici)« ffi. NodeaSOiied^ update, <^M*f 
LiscHeady) ; Figaro S 
17* } el04..{ 

la. tX/To3ah&d{S(^, V^, ^Iti relax, ffi, 

Node^Sched, update, Cdn«»r t>3.s- 
tReady); Figure 5 

Id. } 
ao» > 

22. > 

23. lIUnlTF at 0; 
34- 9fexax +• step I 
35, } 

2S • // end main }ce}Riei 

Figure 4. Main kernel of the proposed algo" 
rithm. 



1. txytoSciK^diiM HiTBUr Kad« vi^ Boole^a fit. inc 
relax, tp sitsee nodfiGSened, nooi«a^ lap? 
dat^, i3it CSnctf, ^st;Ready) { 

{ 

3. ia only one node in 3^ { 

4 . i< (V] * AXAFS tart (} > vi .AnAFStiart () ) { 

5. l£€<B»ii - R(v,)} <= Rh«) { 

7. R<i5k> = R<vi)/ 

8. UK.AddBKvA) r 

d . Siode Bfiched . cle az ( ) ; 

1 0 • NodesScbed . sat (vi) ; 

coa^lnua &009 

12. } 

13. } 

14. ^oleaxt canShare ^ try snaritis with a 
node Ot tbe same type with a path of 
shared FUfi v»itn tne smallest length 
UffsSt&KO: o£ nodes; 

15. i2(oanShazfe> { 

IS. i.£(fit && share prodacas Incraasdl { 

17, eanshare = falsa; 

le. xn>Share (Vi) f 

IS. } *ie« < 

20. izkt cSsa«t » nft.sched(vi) ; 

21. i£(CStt^i£ (selax + cs(;rk>)> { 

22. mapN&de(%. vi. taadeaSdhed^ up- 
dated « CSat«*ir Xilstneady); Fjifiiwace s 

as. } ci^e { 

24. 3ff<ttShave(vi) ; 

as. ' eeftShare & Calae/ 

29* } 
27. > 
38. > 

25. i.£(>ean5hare &6 Cic «a <CSb«# S (relsM^ 

30. in^lfOdd(>Gi, vt, UodesSched. i^ate* 
CSam. Z»istfteady} ; Figure S 

31. } 

32. if <vi net napped and no FU with Opera* 
cion type Of vi in thisTp and does npt 
fit and thiaXF is the laat TP) { 

33. creatie a new TP nk; , , 

34. |^.add(9^>f 

33 » mdgpivi9d9<iinr VL» HodesSched, v^data, 

CSllb), Xii9tReady]r Fisore e 

36. hreaX EiOOP 8f 

37. } 

38. } 

35. } // end tzyTbSohfid 

Figure 5» FuncOon tryToSched. 

1. m^pwodeCTP Hit, i*od» Vi, Blt^ea wodes9cli:^dc 
Boolean update « int GSa*«r Vector Xd.stneady> { 

2. nj, .addBI (Vi) ? 

3 . Nodeasched. se^ Cv^) t 

4. £f<CSnM > CS(ltH>) 

5. Ca(ffk) a CSnow; 
!£ (update) 

'7. vpOateAndSocbALAF(XiistReady« v^) ; 
a* } // end {A&ptitode 

Figure & Function mapNode. 
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4A SfaaringversDs not sharing 

Table IIJ shows results for the considered examples. 
Our* atid Our** identify results obtained by applying the 
pipposed algorithm, Oitf* considers resource sharing for 
both adder and raultiplier'units, and Our** only considers 
resource sharing for multiplier units. #cs identifies the 
exeeutioQ latency (ttumber of clock cycles) and ifp the 
number of TPs. Each solution related to our algorithm was 
obtained in less than Is of CPU time* 

Table IH. Results obtained for ttie exam- 
ples- . 



Examplo 








ASAP 


SA 


Our* 


Ou 




#D i ihs 




#cs 






*P 






6 
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2 
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1 
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SEHWA 


6 


18 


36 


18 


33 


I 


34 
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34 


10 


9 


24 


9 


19 


1 


IS 
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?8 


IS 
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18 
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15 
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15 
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15 


HAL 
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11 


5 
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10 
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JO 


10 


3 
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3 
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BWF 




12 


26 
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23 
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23 
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10 
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22 


6 


22 
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18 
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18 


is 
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19 
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18 
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17 
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18 


FIR 
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14 
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14 


27 
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?o 
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27 


10 


7 
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15 
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15 
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IS 


IS 1 s 


12 


S 


11 


1 


11 
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11 


MAT4X4 


6 


72 


13$ 


72 


134 


1 


ISO 


21 


130 


10 


37 


69 


37 


69 


1 


66 


17 


66 


15 


25 


47 


25 


46 


\ 


46 


10 


46 


20 


16 


29 


16 


29 


2 


29 


4 


29 



The SA results were obtained with a simulated anneal- 
ing version to do temporal partitioning without resource 
sharing proposed in [16]. Here, the algorithm is tuned to 
optimia^ng the overall execution tune (the algorithm can 
also exploit the tradeoflT between execution time and 
communlcaiion costs). The ASAP results refer to the level- 
ing technique proposed In [1 1]. 

Only Mat4x4 needed to start with the number of TPs 
Obtained by the ASAP approach to achieve the best solu- 
tion- For all the other examples, the best solution was ob- 
tained stardng with an initial number of TPs equal to the 
number of levels of the DFG. The results for Mat4x4 in 
Table III were collected disabling the update of the list of 
nodes ready for each node mapped (the list is updated only 
when it is empty). It is strongly recommended to disable 
the update option far examples with high-level degxee of 
parallelism and a small critical path length. 

The values in bold in the tf" and 21^ cokimns of Table 
lit show the mhiimum execution latency for the datapaths 
obtained by the considered approaches (not considering 
configuration times). The values in bold in the 10* column 
represent Aal; even without considering sharing of adders, 
our algorithm retunxs solutions with execution latendes 



equal to the execution latencies obtained sharing all the re- 
sources ii^ column), despite the &ct that those solutions 
need more TPs. 

When considering resource sharing for all FUs» a 
minunum number of TPs (only 4 cases of Tbble III needed 
more than one TP to produce a minimum execution time) 
seems to ensure solutions vfbh lower execution latencies 
than the obtdned by doing ten^oral partitioning with 
ASAP or SA for the majori^ of the samples (only one 
case is not as good as SA). Note diat when all the PUs can 
be shared and the resource overhead to inclement scaring 
is not taken into account, an empirical observation tell us 
that the solutions with lower execution latency are those 
with only one TP. This is expected by the &Gt that a new 
TP produces an equal or worse effect than sharing FU m- 
stances on ^e overall execution latencies because all the 
nodes in that TP can only start executing after the end of 
the execution of the TP immediately before. 

When slurring of adders is not considered the algorithm 
is capable to find solutions without infoior execution 
lafienpy. 

4JZ ExpIoStIng the nuimber of TPs 

An exploitation of the overall execution latency versus 
the number of TPs is shown In Figure 8. Those results 
were produced by calling the algorithm several times, each 
dme starting with a different initial number of TPs from a 
range of 1 to 15. The exploitation has been done in ap- 
pmximately 5-4$ of CPU thne. All the solutions use only n 
single TP and the best resixlt (execution lat^icy equal to 66 
dock cycles) has been achieved when tine algorithm 
started with 8 TPs. The results without considering sharing 
of adders are shown in Figure 9. The algorithm exploited a 
range of TP^ from 1 to 26 and the minimum execution la- 
tency achieved was 66 clock cycles (solution with 21 TPs). 
Based on those results we can select a solution that mini- 
mises the global execution latency taking into account the 
reconfiguration times (see equation (2)). 




1 3 S 7 9 11 13 15 
Initial number of Temporal Partitions 

Figure 8« Execution latency versus the InU 
tTal number of TPs; for Mult^4 obtained by 
the proposed algorithm, when RmmPIO 
(sharing of adders and multipliers)* 
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From riie results presented so ftr we nwy conclude that 
sharing FUs can reduce tiie nranber of TPs. without in- 
closing the overall execution time. Moreover, a minimum 
number of TPs can be a priority, when an FPGA vviCh sig- 
nificant reconfiguration times is used. Due to its low com- 
putational complexity, the algorithm can be used to e;q>loit 
the design space based on the tradeoif between &e number 
of TP& and the overall execution latent. 



Btea 




I i u i i 1 1 i » M i i 1 1 1 1 1 n 1 1 1 

1 3 6 7 9 11 13 15 17 19 21 23 29 
InlUel Number of TPs 



Figure 9. Execution llafcencsy and finaH 
number of TPs versus me initial nunnber of 
TPs obtained by the algorftfim for iViult4x4^ 
when Ramx-'^O (no sharing of adders). 



43 Comparbon yvHSb other schedttlears 

At this point a question may occur: is the algorithm 
competitive when a single TP is envisaged? Table IV 
shows results for EWF and SEWHA, considering various 
sis&es for the available resources (Rmax). The schedules ob- 
tained by the proposed algorithm considering only one TP 
are shown (see die 5^ column). The number of resources 
used for each type of FU ftr ©ach soluti<»i is also shown 
(last column). *'Fixed** refers to results collected fiom the 
state-of-the-art schedulers Il73£l8][19] and represent op^ 
timal (identified with ^) Or near-optimal scheduling re- 
sults (wittiout enter into accoimt with temporal partition- 
ing) for the specified constraint on the number of FUs for 
each type of operation (see fee 2"** column)- The results 
show that our algorithm is efficient, even when we are in- 
terested on a filial solution with a single TP. 

The result labeled with a is achieved without an in- 
cremental update of the Ust of die nodes ready to be 
mapped. This result shows diafc the algorithm did not skip 
from a local minimum, sfcace ^ least the result related to 
Rma3P15 should be achieved. The first 4 results obtateed 
for SEHWA consider the increasing order of the ASAP 
values as the second key (there is no evidence to suggest 
when it is better to use the decreasing or the increasing 
ASAP values as the second key). 

The number of each FU instance allocated by our algo- 
rithm for each Rmax constraint only was different in two 



to the constraints used (with total number of re- 
sources equal to Smax) produce the near-optimal sched^ 
uling results (see Table IV). Thereibre, it seems that our 
algorithm can also be used to a fast identification of die 
mimber of FU instances needed, considering a specifio 
number of maximum resources available on the device. 

Table IV. Comparison of scheduling results 
obtained for EWF and SEHWA. 



1 Approach 
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14 
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S Renamed Work 

As far as we know» the development of temporal pard- 
tioning algorithms was firstly considered in t9][2]. The 
similarities of both scheduling on high-level synthesis [8] * 
and tempoxal partitioning allow the use of common sched- 
uling schemes for panidoning. Some authors^ such as 
[9][10], have considered temporal parddoning at behav- 
ioral levels having in mind the integrs^on of synthesis. 

In [9]p a heuristic based on a static Ust scheduling algo- 
rithm, enhanced to consider temporal partitioning and par- 
tial reconfiguration, is shown. The approach exploits the 
dynamic reconfiguration capability of the devices, while 
doing temporal partitioning. 

In I10](20] the tempore partitioning problem is mod- 
eled in a specified 0-1 non-linear programming (NLP) 
modeL The problem is transformed to integer linear 
programming (ILP) and the solution determined by an ILP 
solver. Due to the long execution times, dus approach is 
not piaodcal for large input examples. Some heurisdc 
methods have been develc^ed to pemsit its usability on 
larger input examples [21 J. Kaul 122) exploits the loop fis- 
sion technique while doing temporal partitioning in the 
presence of loops to minimize the overall latency by utili- 
zation of the active T? as long as possible. Sharing of 
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functional units is considered inside ta^ and tempera] 
partitioning is perfbrmed at the task lev^L Design space 
exploitation h petfonned by Inputting to the tenipora! par- 
titioning algoriftmi different design solutions for each task. 
Such solutions art genera^ted by a high-level synthesis tool 
(constraining ISie number of FUs of each type). This ap- 
proach tacks & global ^evr md is time-consuming. 

The simplest approaches only consider temporal parti- 
tioning without exploiting sharing of FUs* In [11], both a 
temporal partitioning algorithm based on leveling the cp- 
erations by an ASAP scheme and other based on clustering 
a number of nodes are used. The algorithm fills the avail- 
able resources in the increasing order of the ASAF levels- 
The selection of nodes in the same level is arbitraiy and 
the algorithm switches to another TP when it encounters 
the first node fliat does not fit on the current TP. The ^- - 
proach does not consider neither communications costs nor 
resouxce sharing. In [23] another algorithm is presented 

^ that selects the nodes to be mapped in a T? vnth two dif- 

ferent approaches, (one 6>r satisfying parallelism and an- 
other for decitfising conmnmicaticui costs). In [12], an al- 
gorithm based on the extension of the ASAP or ALAP 
leveling sch^es resorting to the mobility of each node to 
select among the nodes has been considered. [12] also 
shows an algorithm that searches recursively in the list of 
ready nodes so that if a node caxmot be mapped to &e cur- 
lent partidon» other nodes can be considered. 

[\6] considers both communication costs among differ- 
ent TPs that can occur and the overall execution time. The 
audiors piesented an extension to smtic list scheduling, 
which pemiits to the algorithm sensitivity to the communi- 
cation costs while trying to minimize the overall execution 
time. The results presented* when compared to near- 
optimal solutions obtained with a simulated annealing al- 
gorithm tuned to do temporal partitibxung while minimiz- 
ing an objective function, that integrates the execution 
time of the TPs and the communication oostSp revealed the 
efRctency of the approach* 

^ [24] presents a method to do temporal partidoning con» 

sidering pipelining of the reconfigumtion and execution 
stages. The approach divides an FPGA into two portions to 

' overlap the execution of a TP in one portion previously 

leconfigured) with the reconfiguration of the.other portion. 

In [25] constraint logic programmmg is used to solve , 
temporal partitioning, scheduling, and dynamic module al- 
location. However^ the approadh needs a ^lecification of 
the number of each FU before processing and m^ safSsr 
of long runtimes. 

More related to our approac^i is the algorithm presented 
in [26]. A scheme based on the force-directed list schedul- 
ing algorithm that considers resource sharing and temporal 
partitioning is shown. The algorithm tries to minimize the 
overall execution time, performing a tradeoff between the 
number of TPs and faring of FUs. However, the approach 
adapted scheduling algorithm not originally tailored to 



do temporal partitioning and lacks of a global view, in- 
stead, our approach proposes a novel algorithm matched to 
the combination of temporal pairitioning and sharing of 
FUs that maintains a global view. 

6 Conclusioms smd Fatiare Worfe 

In this paper we have presented a new and useful algo^ 
rithm combining temporal partitioning^ sharing of iunc* 
tional units, scheduling, allocation and bmding. Unlike 
other approaches^ this algorithm merges those tasks in a 
combined and global method. The obtained results, from a 
number of benchmarks^ strongly confirm the ei^ciency 
and efi^tiveness of the idea. 

The low computation toe achieved^ when dealing with 
the presented examples^ shows that the algorithm is fast 
and efficient and thus can be used on laige examples. 

The inclusion of &nctional units with pipeline stages 
and the consideration of more than one implementation for 
a given operation will be considered in a near future. An- 
other important issue is the overlapping of reconfiguration 
and execution that should be considered by future en* 
hancements. Finally^ aspects related to conditional paths 
and loops will also need to be foctised of future wod;. 
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1 Introduction 

This docaiment descacibes a mefeod for compiling a subset of a high-level programimag language (HLL) 
like C or FORTRAN, ext^ded by port access functions, to a reconfigurable data-flow processor (RDFP) 
as described in Section 3. L e., the program is transformed to one or several configurafiwis of the RDFP. 
TWs method can be used as part of an extended compfler for a hybrid architecture consisting of standard 
host processor and a reconfigurable data-flow coprocessor. The extended conipner handles a full BLL 
like standard ANSI C. It maps suitable program parts like inner loops to the coprocessor and the rest of 
the program to the host processor. However, this extended compiler is not subject of this document. 



2 Compilation How 

This section briefly describes the phases of the compilation method. 
2*1 Fiontend 

The compiler uses a standard fkontend which translates the input program (e. g. a C program) ixxio an 
internal format OF) consisting of an abstract syntax tree (AST) and symbol tables* The frontend also 
performs well-known compiler optiraissations as constant propagation, dead code elimhiation. common 
subestpression eUmination etc. For details, refer to any compaer constiuction texfljook lite [1]. E. g.» the 
StUF compiler [2] can be used for this purpose. 

2Ji Temporal Parfitioiiing 

Next^ the program's IF representation is partitioned into sections which are executed sequentially on the 
RDFP by separate configurations. If the entire program can be executed by one configuration (fitting on 
the given RDFP), no temporal partitioning is necessary. This jrfiase generates reconfiguration statements . 
which load and remove the configurations sequentially acconMng to the original program^s control flow. 



23 Configuration Generation 

FinaHy, tiie program sections detenmined by the temporal partitioning are mapped to RDFP configura- 
tiotts- This phase generates a program code or data structure wiiich is then used to directiy program the 
RDEP. 

3 Configurable Objects and Functionality of a RDFP 

This section describes the configurable objects and funcdtionality of a RDFP. A possible iropltoientation 
of the RDFP architecture is a PACT XFF^ Core, Here we only describe the minimum reqmxements for 
a RDFP for this compilation method to work. The only data types considered are multi-bit words called 
data and single-bit control signals called 6Vfi«fis. Data and events are always processed as packets^ cf. 
Section 3.2, 

2002' I' IB jq>pvcpat VOA ConSdmtial 



20- JPlN-2003 23 s 26 ' 



PAT.-PINW. P. PIETRUK 



+49 721 46930B S. 56/73 
056 20.01-2003 23:25:04 



A Method for Compiling Hi^-Lev^ Language atagrams to a ReconGgumble Data-Flow Pjroce$$or 3 
3.1 Configorable Objects and Functions 

An RDFP consists of an array of configurable objects and a communication network. Each object can 
be configured toperfom certain fimctions (listed below). It performs the same function repeatedly untU 
the configuration is changed. The array needs not be completely aniform» L e. not all objects need to be 
able to perform all functions, E. g., a RAM function can be implemented by a specialized RAM object 
which cannot perform any other Amotions. It is also possible to combine several objects to a '^'macro'* to 
lealize certain functions. Several RAM* objects can, e. , be combined tt> realize a RAM fmiction with 
larger storage. 

After a configuration has been removed, all information is lost Only the contents (values) of a RAM are 
preserved during reconfiguration. 
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Hgnre 1: Functions of an RDFP 

The following functions mainly handling data pacifists can be configured in an RDFE See Fig, 1 for a 
graphical representation. 

• ALU[opcode]: ALUs perform common arithmetical and logical operations on data. ALU func- 
tions ("opcodes'*) must be available for all operations used in the HLL J ALU functions have two 



^OOicrwise program^ containing operations which not have ALU opcodes in (he RDFP must be excluded from the 
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A Msthod for Compaing Higb-Lev&l Language Programs to a ReconSgarabh DatarHow Pxpcessor 4 

data inputs A and B» and one data output X. Comparators have an event output U instead of the 
date output They produce a 1-eveut if the comparison Is true, and a O'-event otherwise. 

o CNT: A counter function which has data inputs LB, UB and iNC Oower bound, i^per bound 
and incremenO and data output X (counter value). A packet at event input START starts the 
counter, and event input NEXT causes the generation of the next ou^ut value (and output events) 
or causes the counter to terminate if UB is reached. If NEXT is not connected, die counter counts 
continuously. The ou^ut events' V, and W have the following functionality: For a counter 
counting N times, N-1 event packets with value 0 (0-events) and one event packet with value 
1 (I-evmt) are generated at output U- At ouipuc V, N 0-events are generated, and at ou^ut W, 
N 0-events and one l^ent axe created. The 1-event at W is only created after die counter has 
tenrdnatedp i. e* a NEXT event packet was received after the last data packet was output. 

o RAMCsize]: The RAM function stores a fixed number of data words ("size"). It has a data input 
RD and a data output OUT for reading at address KD. Event output ERX> signals completion of 
the xead access. For a write access, data inputs WR and IN (address and value) and data output 
OUT is used. Event output EWR signals completion of the write access. ERD and EWR always 
generate 0-events» Note that external RAM can be handled as RAM functions exactly liJce intenial 
RAM. 

o GATE: A OATE synchronizes a data packet at input A back and an event packet at input EL Wbsa 
both have arrived, they are both inputs consumed. The data packet is copied to ou^ut X, and the 
event packet to output U* 

o MUX: A MUX function has 2 data inputs A and B» an event input SEL, and a data output X. If 
SEL receives a 0*packet, input A is copied to output X and input B discarded* For a 1 -packet, B is 
copied and A discarded. 

D MERGE: A MERGE function has 2 data inputs A and B, an event input SEL, and a data output X. 
If SEL receives a 0-pac1$^t, input A is copied to output X, but input B is not discarded. The packet 
is left at the input B instead. For a l^packet, B is copied and A left at the input. 

c DEMUX: A DEMUX ftinction has one data input A, an event input SEL, and two data outputs X * 
and Y. If SEL receives a 0-packet, input A is copied to output X« and no packet is created at output 
Y. For a i-packet; A is copied to and no packet is created at output X. 

o MDATA: A MDATA function multiplicates data packets. It has a data input A» an event input 
S£L« and a data output X. If SEL receives a l-packet, a data packet at A is consumed and copied 
to output X. Per all subsequent O^packets at SEL, a copy of the ixxput data pactet b produced at 
the output without consuming new packets at A. Ont/ if another 1 -packet anives at S£L> the next 
data packet at A is consumed and copied,^ 

o INPORT[namel: Receives data packets from outside the BDFP through input port **name" and 
copies them to data output X. If a packet was received, a 0-event is produced at event ou^t U^ 
too, (Note that tins function can only be configured at qpecid objects connected to external busses.) 

o OUTPORT[name3: Sends data packets received at data input A to the outside of the KDPP through 
output port "name**. If a packet was sent, a 0-event is produced at event output U, too. (Note that 
this function can only be configured at special objects connected to external busses.) 

supported HLL $ubset or substituted by "macros" of existing Hificiions. 

^Note that this can be iRipIememed by a MEHOE with special properties on XPP. 

2002- J -J8 xppvcpat VOJ Confidcntita 



20- JPlN-2003 23 : 27 



PAT-H=INW. K. KitlKUK 



058 20.01.2003 23:25:51 



A Method for Compiling High-^L&v&l Language Programs io a ReconSgajrsble Data-flow Etocessor 5 



Additionally, the following functions manipulate only event packets: 

• O-FELTER, I-PBLTER: A FILTER has an input E and an output U. A 0-FILTER copies, a 0-event 
from E to U» but 1-EVENTs at E are discarded. A 1-FILTER copies 1-eveats and discards O-eyents. 

• INVERTBR: Copies all events from input E to output XJ bat inverts its value. 

• 0-CONSTANT, l-CONSTANT: 0-CONSTANT copies aU events fiom input E to ou^ut U. but 
ciianges them all to value 0^ 1-CONSTANT changes all to value 1. 

• ECOMB: Combines two or more inputs El , E2, E3*-, produdng a packet at ou^ut U. The output Is 
a 1 -event iff one or mote of the input packets are 1-events (logical or). A packet must be avaUable 
at all inputs before an ouput packet i$ produced,^ 

• ESEQEseq]: An ESEQ generates a sequence *1seq" of events, e.g, "0001^* at its ou^ut V* Jf it 
has an input START, one entire sequence is generated for eacU event packet arriving at U. The 
sequence is only repeated if the next event arrives at However, if START is not connected, 
BSEQ constantly repeats the sequence. 

3JZ Fadcet^liased ConununiGation Networlc 

The cotnmunicadon network of an RDFP can connect an outputs of one object (i. e. its respective func- 
tion) to the input(s) of one or several other objects* This is usually achieved by busses and switdbes. By 
placing the functions properly on the objects, many functions can be connected arbitrarily up to a limit 
imposed by the device size. As mentioned above, all values are communicated as packets. A separate 
communication network exists for data and event packets. The packets synctironize the functions in a 
data-flow fashion. L e., the fhncdon only executes when all input packets are available (apart firom the 
exceptions where not all inputs are required as described above). The function also stalls if the last output 
packet has not been consumed. Therefore a data-fiow graph mapped to an RDFP self-synchronizes its 
execution without the need for external control. Only if two or more function outpus are connected to 
the same function input (N to I connection), the self-synchronization is disabled. The use has to ensure 
that only one packet airives at a time. Otherwise a packet might get lost, and the value resulting fhwn 
combining two or more packets is undefined. Therefore this should be avoided. However, a function 
output can be connected to many Amotion inputs (1 to N connecdon) without problems. 

Tliere are some special cases; 

• A function input can h& preloaded with a distinct value during configuration* This padket is con« 
sumed like a normal packets coming firom another object. 

• A function mput can be defined as constmt. In this case, the packet at the input is xeproduced 
repeatedly for each function execution. It is even possible to connect an output of another function 
to a constant input. In this case, the constant value is changed as soon as a new packet arrives at 
the input. Note that there is no self-synchronization in this case, too. The function is not stalled 
until the new packet arrives sinc^ the cAd packet is still used and reproduced. 

^Noce tbat this &nodon is Implemeoted by the EAND cpeiatcv on the XPP. 
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An RDFP iequire$ regiftter ddays in the dataflow. Otherwise very long comUnational delays and asyn- 
chronous feedback is possible. We assume that delays are inserted at the inputs of some fixnctions (like 
for most ALUs) and in some routing segments of the communication network. 

4 Temporal Partitioning 

The details of Temporal Parddomng need to be inserted from X Cardoso's documents. 

5 Configuration Generation 
5.1 LAugqage Definition 

The following HXX features are not supi}Orted.by tUe inethod described h^: 

• points operations. 

• Ubiary call$> opeiathig system calls (includitig standard I/O functions) 

■ recursive function calls (Note that non-recursive function calls can be eliminated by inncUon in- 
lining and dienefore are not considered here,) 

• All scalar data types are converted to type integer. Integer values are equivalent to dam packets in 
the RDPP. An-ays (possibly multi-dimensional) are the only composite data types considered. 

The following additional features are supported: » 

JNPOFXS and OUTPORTS can be accessed by the HLL functions getstream(name^ value) and put- 
stream(name, value) respectively, 

5<2 Mapping of Ht^-Level Language Constracts 

This method converts a HLL program to a control/data-flow graph (CDFG) consisdng of the RDFP 
functions delGbaed in Section 3.1. Before the processing starts, all HLL program arrays are mapped to 
RDFP 'RAM, functions. An array x is mapped to RAM RAM(x). If several arrays are mapped to the 
$ame RAM, an ofi^et is assigned, too. Tbe RAMs are added to an initially empty CDPO« There must be 
enou^ RAMs of sufficient size for all program arrays. 

The CDFO is generated by a traversal of the AST of the HLL program. The following two jdeces of 
information are maintained at every program point'^ during the traversal: 

^In a program, prpgmm points are l^etween two statements or Ijefore the be^miing or after the end of a program stnxeture 
like a loop or a coae&tionfil scaiement. 



2002' I '18 Kppvcpat VOJ 



ConBdential 



20-JflN-2003 23:27 



PflT.-RNW. 1^. t-itiKUK 



060 20.01.2003 23:26:37 



A Method for Compilmg jJi^-Level LangastgQ ftpgrams to a ReconBgixrabto Dam^Flow Processor 7 

• STARTpointe to an evmt output of an object It delivers a 0-event whenever the program ^ecotion 
at this program point starts* At the begtnuing, a 0-CONSTANT preloaded with an event input is 
added to the CDFG. (It delivers a O-event immediately after configuration.) START initiaWy points 
10 its ou^ut Hie STOP signal generated after a program part has finished executing is used as 
tiew STAJn* signal for the next program part or signals terminadon of die entire program. 

• VABLIST is a list of {vcaiabtey object-output} pairs. The pairs map integer variables (no SEzays) 
to a CDFG object's output. The jSrst pair for a variable in VARLIST contains the ou^ut of the 
object which produces the value of this variable valid at Aie program pQint Hew pairs are always 
added to die front .of VARLXST. The expression VARDEFCvar) refers to the the cbj^ct^tOpm of 
the first pair with variable var in VARLIST.^ 

The following suhsecdons systematically list all HLL components and describe how they are processed, 
tfiereby alteritig the CDFG, START and VARLIST* 

SJ2^1 Integer Eaqpressions and A$s%pments 

Stmight-llne code without array accesses can b& dii:ectly mapped to a data-flow graph. One ALU is 
allocated for each operator in the program. Because of the self^'synchronizadon of the ALUs* no explicit 
control or scheduling is needed. Therefore processing these assigtiments does not access or alter START. 
The data dependences (as xhey would be exposed in the DAG representadon of die program [1]) are 
analyzed through die processing of VARLIST* These assignments synchronize diemselves through the 
data-flow. The data*driven execution automatically exploits the available instrucdon level parallelism. 

All assignments evaluate die right-hand side (RHS) or source expressitm. This evaluation results in a 
pointer to a CDFG object's output (or pseudo-object as defined below). For integer assignments, the 
left-hand side (LHS) variable or destination is combined, vrith the RHS result object to form a new pair 
{LHS, sesultCRHS)} which is added to the front of VARLIST. 

The simplest statement is a ocHistant asslgtied to an integer:^ 

* *a « 5; 

. It doesn't change the CDFG, but adds {a, 5} to the fiont of VARUST^ The ccaistant 5 is a •^eudo- 
object** which only holds the value, but does not refer to a CDFG object. Now VARDBF(a) equals 5 at 
subseqent program points before a is redefined. 

Integer assignments can also combine variables already defined and constant: 
b « a * 2 + 3; 

In the AST, the RHS is aheady converted to an expression tree. This tree is transformed to a combination 
of old and new CDFG objects (which are added to the CDFG) as follows: Each operator (internal node) 
of the tree is substituted by an ALU with the opcode coiresponding to the operator in the tree. If a leaf 
node is a constant, the ALU*s input is direcdy connected to that constant. If a leaf note is an inte^r 
variable var, it is looted up in VARLIST, i. e. VARDEF(var) is retrieved. Then VARDBF(var) (an output 
of an already existing object in CDFG or a constant) is connected to the ALU's input. The output of the 
ALU corresponding to the root operator in the expression tree is defined as the result of die RHS, Finally, 

^This method of Uftin^r a VARLEST is adapted ftom the TrandmogriSer C compiler [3]. 
^Noie chat we use C synt^ for the following examples. 
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a new pair {IMS, result(RHS)> is added to VARLIST, ff the two assignments above are processed^ the 
CDFQ with two ALUs in Fig, 2 is created.'^ Outputs occurring in VARLIST are labeled by Roman 
numbers* AJfter these two assignments, VARUST = [{b. I}, {a. 5}]- (The front of the list is on flie left 
side.) Note that all inputs ccHinected to a constant (whether db:ect from ihe npression tree or retrieved 
from VARLISI) must be defined as constant Inputs d^ned as constants haye a small c next to die input 
arrow in Hg- 2. 

SJZ.'Z Conditional Integer Assignments 

For condidonaj if-dien-else statements containing only integer assignments, objects for condidon eval- 
uation are created first. The object event output indictating the condition result is kept for choosing 
the correct branch result la^. Next, both branches are processed in parallel, using separate copies 
VARLISTl and VARLIST2 of VARLIST. (VARLIST itself is not changed-) Finally, for all variables 
added to VARLISTl £>r VARLIST2, a new entry for VARLIST is created (combinadon phase). The valid 
definitions from VARLISTl' and VARLIST2 are combined with a MUX fbncdon, and the correct input 
is selected by the condition result. For variables only defined in one of the two branches, the multiplexer 
uses the result retrieved &om the original VARLIST for the other branch. If the original VARLIST does 
not have an entry for this variable, a special '\indeflned" constant value b used. However, in a function- 
ally correct program this value will never be used. As an optimization, only variables live tl] after the 
if'*then-olse structure need to be added to VARLIST in the combination phase. 

Consider the following example: 

i = 7; 
a ^ 3; 

if (i < 10) { 

a = 5; 
c » 7; 

} 

else { 

c = a If 
<d 0; 

> 

Fir. 3 shows the resulting CDFG. Before the if-fhen-else construct, VARLIST = [{a, 3}, {i, 7}]. After 
processing' the branches, for the then branch, VARLISTl = [{c^ 7}, {a, 5}, {a, 3}, {i, 7}], and for the 
else branch, VARLIST2 = I{d, 0}, {c, I}, {a, 3}, {i, 7}]. After combmadon, VARLIST » [{d, H}, {c. 
mh {^TVh {a.3},{i>7}L 

Note that case- or switch-statements can be processed, too, since they can - without loss of generality - 
be converted to nested if-«then-*^$e statements. 

This processing of conditional statements dgesn^t need explicit control, either. Both biraiiches are exe- 
cuted in parallel and synciu-oni^ed by the data-fiow. 

^Noc« that ch« input and output names can be deduced from their posicioik, cf. Fis- 1- Also note that the compiler hOnr 
cea^ would Dom>dIIy have substituted ihe second assignment by 1> » 13 (coDScanc iHratpagcttion). For the simplicity of this 
ex()ldnacioii, no fiontend opdxnizadons are consider^ in this and the following exaznplse; 
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Arr^ Accesses 

In contrast to the above section^, awray accesses have to be controlled explicitly to maintain the coitect 
execation oider. For a read access the teod addiess is coiuiected to data input RD, For a wite access, 
tfie write address is coxmected to data input WR and the write value to input IN. Alls these inputs are 
connected to their respective sources ftxough a GATE conorolled by START. A STOP event signalling 
completion of the array access must be assigned to the START variable. Since there's only one STAJIT 
event packet available, only one array access can occur at a time, and the execution order of the original 
program is maintained. This scheduling scheme is similar to a one-hot controller for digital hardware. 

If a RAM is read and written at only one program point, the ERD or EWR outputs can be used as STOP 
events. However, if several read or several write accesses (f5fom dififerent program pc^nts) to the same 
RAM occur, each access produces a ERD or EWR event, respectively. But a STOP event should only be 
executed for the program: point cuixently executed, the current access. This is achieved by connecting 
the START signals (L e. those connected to the GATEs) of all other accesses with the inverted START 
signal of the current access. The resulting signal produces an event for every access, but only for the 
^ cuirent access a 1-event. This event is combined (ECOMB) with the RAM's ERD or EWR access. The 

' ECOMB's output will only occur after the access is completed. Because ECCBVIB OR-combines its 
event packets, only the current access produces a 1-event- Next, this event is filtered with a 1-FILTER 
and changed by a 0-CONSTANT, resulting in a STOP signal which produces a 0-event only after flie 
cuiient access is completed as required. See below for an ^cample. 

For computing the RAM addresses, the compiler &ontend's standard transformation for drray accesses 
can be used. The only difference is that the offset witii'reqpect to the RDFP RAM (as deternuned in die 
initial mapping phase) must be used. 

For several accesses, several sources can be cormected to the RD, WR and IN inputs of a RAM. This 
disables the self--^chronization. However, since only one access at a time can happen, the GATEs only 
allow one data packet to arrive at the inputs. 

For read accesses, the packets at the OUT output face the same problem as the BRD event packets: They 
occin- for every read access, but must only be used (and forwarded to subsequent operators) for the current 
access. This can be achieved by connecting tiie OUT output via a DBMUX function. The Y ou^ of 
the DEMUX is used, and the X output is left unconnected. The it acts as a selective gate which only 
^- forwards packets if its SEL input receives a I -event, and discards its data input if SEL receives a 0-event. 

The signal created by the ECOMB described above for the STOP signal creates a 1-event for the current 
; access, and a 0-event otherwise. Using it as the SEL input achieves exactiy the desired funcitonality. 

To avoid redundant read accesses, RAM reads are also registered in VAROST Instead of an integer 
variable, an array element is used as first element of the pair. However, a change in a variable occurring 
in an array index invalidaies the Information in VARLIST* It must tiien be removed from it 

The foHowing example shows two read accesses: 
y - 

a = X + y; 

Fig. 4 shows the resulting CDFG* Inputs START (old), i and j should be substituted by the actual func- 
tions resulting from the program before the array reads. The signal indicating the STOP of the first access 
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is mazfaed by 5TOP1. Write accesses use the same control events^ but instead of one GATE per access 
for the RD inpute, one GATE for WR and one gate for IN. (with the same E input) are used. Also no 
outputs need to be handled. 

Fig. 5 shows the access a [il = x; for the simple case that the RAM is only written once J. e. at one 
program poinL 

This scheme ^ecutes RAM accesses correcdy, but not veiy fast since all accesses are synchronized even 
if this is not necessary. The fbllowing optimizations are possible: 

* Only accesses to Oie same RAM are synchronized. Accesses to different arrays can occur concur- 
rently or even in changed order. When diere is a data dependency^ tbo accesses self-synchronize 
automatically- This can be achieved by inaiTxtaimng a separate START signal for every RAM. At 
the end of a basic block (11, aU these START signals must be combined by a ECOMB to provide 
a new Signal Sor die next basic blodk. 

r^K * For sequences of either read accesses or write accesses (not mixed) widiin a basic block, it.is 

f possible to stream data into the RAM rather tfxan waiting for the previous access to complete. For 

diis purpose, a combination of MERGE functions selects the RD or WR and IN inputs In tiie order 
dictated by die sequence. The MBRGEs must be controlled by iterative ESEQs guamnteeing that 
the inputs are only forwarded in this order. Then only the first access in the sequence needs to 
be controlled by GATEs, the other OATBs can be removed to increase throughpuL Similarly, the 
OUT outputs of a read access can be distributed more efficiendy for a sequence. A combination 
of DEMUX functions with the same ESEQ control can be used. For read accesses, the generation 
of the last output can be sent through a GATE (without the E input connected^, thereby producing 
a STOP event 

Fig. 6 shows the following three army reads in tbte optin^iized fashion. 

X - a[il; 
y « atjj; 
2 » a(k] ; 

SJZA Kaput aitdl(hitput Porte 

Input and ouQ)ut pom are processed sunflar to vector accesses. A read from an input port is like an 
airay read -withoat an address. The input data pat*et is sent »> DEMUX ftmctions wijich send it to the 
conect subsequent operatois. The STOP signal is generated in the same way as described above for 
RAM accesses by combining the INPORT's U output with the current and other START signals. 

Output ports control the data packets by GATEs like aiiay write accesses. The STOP ^gnal is also 
ereatsd a$ for RAM accesses. 

5.2.5 General Conditioned Statements 

Conditional statements containing either atray accesses or inner loops cannot be processed as described 
in Section 5.2.2. Data packets must only be sent to the active branch. Therefore, a dataflow analysis is 
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required. ?Wwr*andd?^e<f ofboflibfanchesmnstbecompmiai l^or all variables in aith«-nf 

these sets DmuX functions cont«>Ued by tfae IF condittot» are inse^d. Tl.ey J.ur^S paclS^^y 

fr^.^^^"^' ^ then-taanch is processed with VARLBTl, and the else branch with 
VARtIST2. Fin Jy. the output values are combined. Since only one branch is evw activated theie 

?° ^''^ simultanuously. The combinations wiU be added to 

VARUST after die con^tional statemenL «»"«cu w 

5^6 Loops 

AFOR fopp is controlled by a counter COT. The lower bound (LBX upper bound (UB). and increment 
(INQ expressions are evaluated like any expressions (see Sections 5.2.1 and 5.2 J) and connecfed to Z 
restive inputs.. JhfSTAp input is connected to the START signal. The new START sianal (after 
loop execution) is CNT's W output sent through a 1-FILTER and (XIONSTANX (W only w^taal 
event after the counter ha^enninated.) CNTs V output produces one O^nt for each loop iteiSion and 
y- IS ft«ef«re used as START for the loop body. FinaUy, CNTs NEXT input is connected to the CT/^ 

y signal at *e end of the body (i e. its STOP signal.) This assures that one iteration only starts 

Je pervious one has fimsfaed. CNT's X output provides the cunent value of the loop index Ste vZ 
FOR loops, dataflow analysis is required, too. "wanaDie. For 

^^y^^f^'^^^^'^^^'^^^'^y^^^^^ of &einputvahie(fron, 

VARUST at loop entry) and ^feedback value from the end of the loop is created. ^TTf 
these signals is connected to a DEMUX which Is controlled by CNT»8 W outont Tt ^ 

body contains these DEMUX outputs. After loop termination, the input of feedback vXS L ii?^ 
theoujutoftheloop(I-«ve„t).Thevarlistattheendofthete^^ 

not defined m the loop are taken from the input VARLIST* a outputs, inputs 

The processing of theloop body requires some special consideration. Datapackets from variables dafinPH 
outside the loop but only used inside the loop (not redefined) do not lead^to Ae ^^ifl feSact 
signal as explained above. Therefore only one packet is availabfe (unless U isTS^t? 
consumed in each loop operation. This would stall the lo^^ 2 ^ ^ a constant), but it is 

These methods aUow to process arbitrarily nested loops and conditional siaiemenis. 
Fig. 7 shows the generated CDFO for the fbUowin fbr loop. 

a «»• b + c; 

for <1««0; i<-=XO; { 
a = a + 1; 
xCl] = k; 

J 

the counter anyway. ^ ^ generation is controlled by 
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5^7 WHILE Loops 

WHILE loops are pmcessed simUarly. Tlie STOP signal (new START signal) is generatBd fiom the loop 
condition, fed through a O-FELTER. When the loop finishes, an addidooa] signal (similar to die CNT's 
W ou^t) must be ge&raated v^ch controls the DEMUXes to genei^ an oo^pm. 

5.2^ Paralleiization, Yectoiizailon smd Ptpdinlng 
The method described so f^genejkesCDFGsperfomiuig the HlXprogia^ 

However, the program execudon is unnecessarily sequradaUzed 1^ the SIART signals. Zh maonr eases 
this is too restrictive. Several optimizations are possible. 

Independent loops (operating on different variables and arrays) need not be sequentialized. They can 
use die same START signal, and operate independenfly. After «tecation, their STOP signals most be 
combiaed bgr BCOMB, foiniing a iiew STAia* signal fixr die s^ 

V In some cases, loops can: be vectorized. This means that loop iterations can overlap, leading to apipelined 

data-flow through the operators of the loop body [4], This technique can be easily appKed i» the method 
described here. For FOR loops, the CNT-s NEJCT input is removed so that CNT counts contiguously, 
thereby overiapping the loop iterations. Since vectorizable loops have no memoiy access conflicts, die 
read and write accesses to the same RAM can also overlap. Especially for dual-ported RAM tiiis leads 
to considerable performance improvements. In ifais case separate START signals mast not only be main- 
tauied for each RAM. but also separately ttx read and-write accesses. 

Finally, loop transformations Eke loop unrolling, loop distribution, loop tiling or loop meifiing [4] can 
be aplied to mcrease the paraUelism and improve performance. 
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